Hello and welcome to Data Analysis
with Python Zero to Pandas. This is a free certification course
being organized in collaboration by Jovian and free code camp. My name isAakash. I'm your instructor for the course. I'm also the CEO of Jovian. And you can find me on Twitter
at and Jovian is a sharing and collaboration platform for data science. We've been sharing all the course
materials with you using Jovian, and you've been submitting the assignments
using Jovian as well and free code camp.org is a place where you can
learn to code for free by building great projects and interacting
with the worldwide community. So do check it out. Now the course Zero to Pandas is
according first introduction to Data Analysis with Python, and we've had
video lectures every week with live coding on cloud-based Jupyter notebooks. We also had assignments for you
to practice the concepts and the course project that you're
probably currently working on. And once you do these, you will be able
to own a certificate of accomplishment, all free of cost, and this will
be a verified certificate that you will be able to showcase on your
LinkedIn profile and on your resume. And finally, so the topic for today,
our final lecture is exploratory Data Analysis, a case study where we will be
taking everything that we have learned in the entire course and bringing it
together into a single project where we will be analyzing a real world
data set, and we will be asking and answering some interesting questions
about the data and we will figure out how to use the right tools and
techniques and libraries and functions. And that I, that right. Places, and we will be drawing
some inferences and conclusions. So this is in some sense, a walk
through of a real world project that you might do in Data Analysis. So with that let's get started. So the first thing you should do is go
to the course page zerotopandas.com. And on the course page, you will be
able to find all the course materials. So today, since we are looking at lesson
five, so you can sorry, less than six. So you can scroll down to lesson
six, exploratory Data Analysis, a key study and click open. This will take you to the course page and
the lecture page where you will be able to see the video to the video that you're
watching right now will be available as a recording for you to review. Now for each lesson lecture, we
have been using Jupyter notebooks. So today we will be using the
Jupyter notebook EDA on stack overflow developer survey. So you can just click on this
link and that will bring you to this Jupyter notebook. Now you can view this Jupyter
notebook, you can read through it. But what we would want to do is we
would want to run this notebook. Now you'd need not run it right now. If you watching live, you
just watch the lecture. And after the lecture, you can run
this notebook and experiment with it, but I'm going to run it right now. And we are going to work through
it and we click the run button and click run on binder to start the
notebook up on the binder platform. Now, this might take a couple of
minutes and in the meantime, I just want to show you how to ask questions. So if you have questions during
or after the lecture, you can come to the lesson page and here click
on the discussion forum link. And this will take you to a
page on the discussion forum. This is where we've been having all
our discussions, and you can use this tread for asking questions. So all you need to do is click the
reply button and post your question. There's a big blue reply button and you
just post your question and hit reply. And either somebody from the course
team or somebody from the community will be we'll answer your questions. Okay. So please ask your questions on the
forum and also answer questions. If you already know answers to some of
the questions that have been asked to you. So returning to the notebook, this
notebook is called exploratory Data Analysis using Python a case
study, and we have clicked the run button and selected run on binder. So finally here we have our
Jupyter interface running, so let's just open up the notebook here
by python-eda-stackoverflow-suvey.ipnb And the first thing that I'm
going to do is I'm going to click on restart and clear output. This is to clear all the outputs so that
only the code remains and we can run the code and see the outputs for ourselves. Then I'm also going to hide
the header and the toolbar. You need not do this, but I'm
just doing this so that we have a little more space to work with. So in this notebook, exploratory Data
Analysis, using Python, a key study, we will be analyzing responses from the stack
overflow annual developer survey, and we will apply all the things that we have
learned so far about Python, Jupyter, Numpy pandas matplotlib seaborn together,
and we'll bring everything together. Now you can run this code online
just in the way that we have by using the run button that we just
use to run this code on binder. But you can also run this code on your
computer locally, so you can follow the instructions given here under option. Now, as I mentioned, we will be
analyzing the stack overflow developers survey data set for our analysis. And this is an annual survey
conducted by stack overflow. And you can find the raw data
and results on this page insights start stack overflow.com. In fact, we will be analyzing the
results from 2020, but you can also see the survey from the past years. The first thing that we would want to do
is to download this data set into Jupyter, and there are many ways to do this. We've seen many ways or over
time across different lectures. So the first thing that you could do is
just download the CSV manually and upload it using Jupyter's graphical interface. So let's do that. Let's see how to do that. So you click download for dataset and that
points us to a Google drive link, and then you can download the Google drive link. And so here I'm downloading it
to my desktop, and then you can unzip the Google drive link. So once you unzip this link, you will
see this folder developer survey 2020. And if you open up this folder,
you will see here that you have a PDF and then a bunch of CSV files. And then a read me file. So you can go through the, read me
to learn more about the dataset, but probably the interesting files for
us are going to be the CSV files. So we might need to upload
the CSV files onto Jupyter. So the way to do that is to come back
to the notebook, go file open, and then you will find an upload button here. So click upload, and just select
the file that you want to upload. So in this case, let's say as
an example, let's try and upload survey results, schema CSP. So one by one, you can upload
each CSV file that you need. Okay, so that's one way to do it. You can click the upload button
to complete the upload, but I won't do that right now. Then there's another way to do it. If you have a direct
link to the raw CSV file. Okay. So this is very important. If you have a link, which
points directly to the CSV file. So a Google drive link, or
some other link will not do it, has to want to the raw file. Then you can use the URL rectory function
from the URL lib dot request module. So this is what we've used
in our previous lectures. We've used this in the
lecture, NumPy and pandas. So you can review those lectures, like
just three and four to see how to use this, or just check the documentation. But today we are going
to use a third method. We are going to use a helper
library called open datasets. So this is a library that we have created
for you to make it easier for you to do Data Analysis, where we are creating
a curated collection of datasets for Data Analysis and machine learning. And these datasets can be downloaded into
Jupyter with a single Python command. So this is how it works. We have to install this library,
PIP install, open datasets. So let's just, come back to
the Jupyter notebook and let us run PIP install open datasets. so this is installed the
open data six library. Then we import open datasets as ODI. So we just import open datasets as ODI. And then all we need to do
is run audi.download, and then pass in a dataset ID. So we have currently added
about six datasets here, but we will be adding many more. We're trying to make sure that
we have at least a hundred data sets here by the end of this week. So you will see more here and we
simply need to pick up the ID of the dataset that we want to use. So here we have the stack
overflow developer survey, and we need to pace it ID here. So we just do audi.download and under
the hood, this is going to fetch the list of files in the dataset, and it
is going to download all the files. So you can see that these. You are orders have been fetched and they
have been downloaded into this folder stack overflow developers survey 2020. Okay. So with that, we have, we seem
to have the dataset downloaded. Once again, you can go file open. And you can check that there is a folder
here, stack overflow developers survey 2020, and this contains the read me and
then the public file and the schema file. Okay. So those are different ways
to download the dataset. The important point is to have the
files next to your Jupyter notebook, irrespective of how you get them. All right. So let's just check import dos
module and verify once again, that. Dataset is actually downloaded. So when we run our start list, TIR on the
stack overflow developers survey folder, we see a read me, we see some survey
results and we see the survey results, public MECs survey results schema. So there are three files and you
can go through these three files. So the read me contains some
information about the dataset and then the survey results schema
contains a list of questions. And so let's maybe just view, load up
that file and view that file as well. And then there are the
survey results public. So this contains the full list
of responses to those questions. Okay. So we will load the CSV files using
pandas and let's import pandas as PD, and let's use the pd.read CSV function,
where we will pass in the part to the survey results, public dot CSV file,
which contains the survey responses. And we are going to call it survey raw DF. So the reason for calling it, that
is because this is going to be the raw data set, the unprocessed
data set that we are loading up. And then we are going to do
some modifications to create a prepared data set for analysis. So here we create a survey or DF
by calling period or read CSV. And let's take a look at this. So this is what the data frame looks like. It looks like it has a
whole bunch of columns. In fact, in total, there
are about 61 columns. So 60 of them are the recur, each
column corresponds to one question from the survey and here, what is not the
full question, but just a short form. And then there is also one column
which contains a respondent ID. So these survey results are anonymized. So there is no personally identifiable
information, like first name, last name, phone number, email, et cetera. So the results have been anonymized
and every respondent has been given an ID, which is not really
useful for us there's one column. If you want to use the respondent
ID, and then there are 60 columns, one for each question. Now, if you want to know what the
actual question was, this is where you can use the other, which is a
survey, which is the other file, which is survey results, schema dot CSV. And we see that in just a moment. But you can see at the bottom here
that there are over 64,000 responses. And as we set 61 columns and let's
just see a list of columns in the data frame, so we can use surveyed
or DF dot columns to see a full list of columns from the data frame. A lot of these may not make sense. So this is where we need to use
the schema file to understand what these columns represent. So here we have this file. Let me just put it on a
single on a separate line. So here we have this file, schema survey
results, schema dot CSV, and maybe let's re load up the entire file first. So we can just do speedy.read
CSV and schema file name. So this file is a CSV file
that contains two columns. One of the columns, the first
column is titled column. So this corresponds to the names of all
these columns, except respondent, so main branch hobbies, et cetera, et cetera. Actually it contains the
respondent ID as well. So for each of these columns, you
can see the actual question here. So for instance, main branch means
this question was which of the following options best describes
you, et cetera, et cetera. Hobbyist means do you code as a hobby? So that's what the correspondence is. Now it's okay to have it in this
format that, but the schema data frame is primarily going to be used to
access the question for each column. So what we might first want to do is
we might just want to set the index by loading the file and set the index
call argument to this call so that when we read it now we just have one
proper column and then the column. It self is just the index. So now we have the respondent and then
we have this one column question texts. So now we can access the question text
using the respondent as a key, right? So you can, for example, imagine
doing dot look and then passing in the respondent key here to access
the question text for the respondent. Okay. So now let's simplify it a little bit
further since we only have one column. We don't really need a data frame. Data frames are required when we want
to go over multiple columns of data. So we can simply get the question text out
of this, and that will give us a series. okay. It looks like this is stuck. Yeah. So we can just read the CSV file,
the data frame and from it, just get the question, text column out of it. So that will give us a series. And that is all we need, really
know where the series has an index, which is the column name. In the main data frame and then the
series has a value each value corresponds to the full text of the question. Okay. So that is what we have done here. We've created a schema row, which
is where we redone the schema file with the index column has column. And we just create a series question text. So here is a schema arrow, and
we can now use schema_raw to retrieve the full question texts
for any column in survey raw DF. So now we can check for example,
that yours code pro column. So you can see here that your score
pro column in the survey data frame, corresponds to the question, not including
education, how many years have you coded professionally as a part of your work? All right. So with that, we have loaded up our
dataset and we have them into data frames. We have innovative that we can
now work with it, using the tools that we know and understand. So we are now ready to move on to
the next step of pre-processing and cleaning the data for our analysis. And before we do that, it's always a
good idea to keep saving your work from time to time, because we're running
this on an online service binder. So we simply select a project name. So here I'm selecting a project named
Python EDS stack overflow survey. We installed the job in library, and then
we import Jovian and run jovian.commit(). And this is going to ask us the first
time, this is going to ask us for an API key, which we can get from our
profile and just paste it in here. And that is going to then commit
this notebook to our profile. yeah. So now this notebook has been committed. So you can view this notebook on your
Jovian profile whenever you want to. Whenever, wherever you coming
from, whether you come in from binder or you come at from your
local computer, everything gets saved on your Jovian profile. And then you can take this and
run it on binder whenever you need it to continue your work. All right. So moving ahead, now we have our data as
data frames, and while the survey contains a wealth of information, it contains about
65,000 responses or two 60 questions. We will limit our analysis
to us, a few areas. And this is what you might want
to do for your projects as well. Just pick a team for your project
and do not try to do a lot of different things with the dataset. So we will limit our
analysis to three areas. The first would be understanding the
demographics of the survey respondents who it, is that has taken the survey and the
global programming community in general. And understanding if that survey
responses are representative of the global community. We will also try and understand the
distribution of different programming skills, experiences and preferences. So specifically like things like
programming languages, which programming languages do people
like, which one do they not? And we will also understand some
employment related information, professional and information,
preferences, and opinions. So something related to the kind
of roles people are holding the science and programming fields. Okay. So to do that, let us select
a subset of the columns. So here are some columns for
demographics, and then here are some columns for the programming experience. And then here are some
columns for the employment. So you know that you can use
this, the schema that we had created to check the questions. So do check out the full
questions for each of these before moving forward and let us just. Check how many columns you have selected. So we have selected about 20 columns. Okay. Now what we will do is we'll
take our survey raw data frame. And then if we simply pass in a list
of columns to it as an index that is simply going to select a subset of
columns, and then we are going to take that data and just create a copy of
it and call that survey data frame. So we are creating a copy so that
we can modify this without effecting the original data frame so that if we
make a mistake, we can always recreate this from the original data frame. So we create the survey data frame
and we've just taken the selected columns and created a copy. And then we have created here. We are now going to just pick
out the same selected columns out of the schema as well. So that our schema also contains
just the columns that we need so we can look at it right now. So here we have the survey data frame. This now has only 20 columns, the
columns that we have selected, it still has all the rules. And then we can check the schema as well. So here we see the schema. So you can see here that the schema has
each questions for each of the 20 columns. So you can see in total, it has a. If you check the shape, you
will see that it has 20 entries. And the survey data frame has
about a 64,000 rows and 20 columns. Now we can use the info method of
survey data frame to see the list of columns and different data types. So these are all the different
columns that we had just selected. And you can see that out of the 64,000. Plus entries, not all of
the entries are non empty. So for each of the columns, you'll
see some null values, some values that are empty in the CSV file. So they binders replaces empty
values in the CSV file with NP dot Nan, which is a token. It is used, it uses four empty values. And because there are so many empty
values and because a lot of rows contain many different types of data,
the data type that you see here. Is detected as object
for most of the rules. Now the object data type is okay
while we are working with strings. It's not really a problem, but when we
are working with numbers and we want to perform some new medic computations and
draw some graphs, which involve number some kind of number processing, then we
might need to convert some of these rules. Some of these columns
into numeric data types. Okay. So, far we have just the age and the work
week hours as the numeric data types. If you see age here, it has a datatype
float 64 and then we'll work week. Ours has a datatype float 64, but
there are a few more columns which can also have a numeric data type. So we have the age first core. We have the ears code pro and
we have the ears code column. All three of these are also
numeric, but for some reason they've gotten classified as object. So let's investigate that a little bit. Now, if we just look at survey DF
dot age, first code, and let us look at the unique values for this. So it turns out that most
of the values are numbers. And the question if you're wondering
what the question is, let's just check schema dot age first code. So the question is at what age did you
write your first line of code or program? And the answers to that are mostly
numbers, but there is also an option younger than five years and
older than 85, which are strengths. And we might want to just somehow
convert these into numbers or ignore them because that will keeping them a
string strings will prevent our analysis. So what you can do, what we will
choose to do here is we will choose to replace these with empty values. Okay. And we will choose to convert
the rest of these into strings. And the way to do that is to use
the PD dot two numeric function. So the PD dot two numeric function, and
you can check the documentation, it takes a series or a column, and it converts
that series into a numeric data, right? So it's going to take all of these
and convert these into floats and wherever it encounters a string,
it will show, throw an error. But what we want to do is we want to
ignore the Arabs and simply replace. Any non-numeric values with the non-value
or the empty place holder value. So that's why we are passing
in errors, equal scores. Okay. So this is one thing that we're
doing for the age first score. Similarly, we have your
score and your score pro. So let's just take a quick
look at your score and escort pro is going to be similar. So here's coders, including any education. How many years have you
been coding in total? And if we take that and we check the
unique values of that, So once again, here we have these options, like less
than one year and more than 50 years. Once again, what we are going to do is
we are going to ignore those values. We are going to convert these into
empty values and the rest of you are going to convert into numbers. So once again, we use PD dot two
numeric, and we assign the result back to survey DF year score. So this way we are replacing the
columns with a new version of the same column where all the, not all
the elements, our numbers are empty. And then similarly, we have this
for the year score pro column. Okay. So go through the other columns
and see if there are any other columns that are numeric. But for the time being, we will convert
these three into numeric columns. Now we had two columns earlier
age and work week hours, and then we have three more columns. So in total we have five numeric
columns and we can start seeing some basic statistics about the numeric
columns using the describe function. So here we have, we are called sorority
of dot describe, and you can see that for age, there are the mean, the average age
is about 30 of the survey respondents, but the average age of first coding. So the age at which they wrote their
first line of code is around 15 years of coding is about 12 in on average. And so on what we exert around 40, but you
will start to notice some problems here. It seems like that the
minimum age mentioned is one. A bit seems quite unlikely that a one
year old infant has filled out the survey and the maximum age appears to be 279,
which once again is quite unlikely. And this is a common issue with
real world data in general. And with surveys in particular, you have
to understand that surveys are triggered by people and people are not, there is no
obligation to enter the right information. So sometimes people may. Intentionally putting
the wrong information. While other times there might be
accidental errors while filling in, like maybe somebody was looking to type
17, but they did not press the seven. And so it ended up being one or maybe
somebody was trying to type 27 and they accidentally pressed nine as well. And that ended up with 279. So we should try and solve
for these for each column. You should go through it and
try to figure out if the values in the column make sense. And a simple fix in this case would
be to simply delete the rows where the value in the age column is higher than
a hundred years or lower than 10 years. So we are basically saying that
those entire responses are invalid. Maybe they were maybe they're
invalid unintentionally, or maybe that was intentional. And to delete the rows, we can use
the drop of method of the data frame. So we just call survey
DF dot drop, and then. You can just go try and understand
the syntax here what's happening is that we are first checking
the list of values survey. Do you have dot age where the ages,
where the age is less than 10? So that's going to give us a bullion
series of proven false for each row on whether we age is less than 10 or not. We then use that to select the rows. We then use that to select only those
rows where the age is less than 10. And then to drop to a survey,
do you have door drop? We need to pass an index. So if you just call dot index on this. So we're basically passing all
the IDs of all the, IDs of all the rows that need to be dropped. And then we're saying that we are
dropping, we are doing this in place. So that is going to remove the
rows from this same survey. Do you have data frame and
not create a new data frame? With the remote pros, right? There's a lot to unpack here and the way
to solve it would be to just take each expression and run it on a different
line, on a different code cell and see the results and build it step by step. Or you can also follow this
link here, which, which explains this entire line of code. Or you can just check the
documentation of survey DF, not drop. Okay. So with that, we've done this for
age, less than 10 we've removed. Those rows is greater than a hundred. We've removed those roles as well. And the same holds true
for work week hours. So here, if you check the workweek
hours, it seems like the minimum is one which seems reasonable. Some people might be just
working for one hour a week, but the maximum seems to be 475. And that's probably wrong because
the number of hours in a week is. About 168. All right. So we can take another approximation
here that let us remove all the rows where the value for the column of
workweek hours is higher than 140 hours, which is about 20 hours per day. So now you have survey DF dot
drop, and here we are dropping the where the rows, where the workweek
hours are more than 114, the gender column also allows picking. So there's the gender column. So if you'll see there's a column called
gender, if we just check schema or gender. So here, the question was which of
the following genders describe you and please check all that apply. So this gender questioned about gender
allows picking multiple options. So here we have man woman non-binary
gender queer or gender nonconforming. So these are the three options. But there are cases where people
have big multiple options too. So for instance, there are 120 when
people have big man and non-binary gender queer or gender nonconforming. Now, while this is acceptable in
general it is going to make our analysis a little bit difficult. So we are just doing to do a small
simplification here we are just going to. Peak all of these values where none of
the values have been selected or where multiple values have been selected or
all the values have been selected and simply replace these with empty values. So the reason we're doing this
is that it's going to simplify our analysis just a little bit. So let us simply, so what we're
going to do is we're going to replace all of these with the empty value. So that's going to simplify our analysis. We're just looking at
one category at a time. Okay, so I'm just going to do this. There is a, there is, you
can follow this line of code. So we're calling survey, DF dot ware and
somebody of dot where takes a condition. And then based on the condition,
wherever the condition is satisfied, it replaces the value with a specific
value that we are providing here. And you can do it in place. Okay. So I will leave this as an exercise
for you to figure out on what exactly this does, but the, end result of
this is to take survey, DF gender. And now when we check the value counts,
we only have a single option being selected, man, woman, or non-binary. Okay. And that's just for simplification, not to
say that any of these options are invalid. So with diet, we have now cleaned up. The dataset and prepared
it for our analysis. We've made a few assumptions. We've made a few simplifications
of we've removed certain roles. We've replaced certain
values with empty values. And this is a typical process that
you will follow for any dataset, any real world data set that you
do, that the lesson here is not to jump immediately into analysis. Just first, go through these values
see, where the missing values are. See, if you need to do something
about the missing values. So for strings, not really, but sometimes
for numbers, you may want to do some kind of replacement for the missing values. Sometimes you may want to simply
remove those rows altogether. We have not done that in this case, a deal
with any invalid values, any values that you feel are outside, any normal range you
should get rid of, maybe get rid of those rows entirely, or maybe simply clear out
those values with just the empty value. And a lot of things like that. And then finally, we've now cleaned up our
data frame so we can just check a sample. So I'm just calling. So I already have dot sample with 10. So just to see a random sample
of 10 rules, and this is again, a very good exercise to do. Just go through some sample data
from your dataset, just to get a sense of what the values in each
column and different roles look like. Okay. So here we can see that the countries
it looks like their strings and then the age seems to be okay. It seems to be a number that
are also places where people have not filled the age. Then we have the gender. Remember we've done a simplification here
where we reduced it to one one answer, and then we have the education level. Then there is the
undergraduate major and so on. So just going through each
columns is going to help you make better inferences on the data. And with that, our data pre-processing
and cleaning is complete. So let's just commit our work once again. So moving forward now, before we can
ask any interesting questions about the survey responses, it would be helpful for
us to understand what the demographics, which is things like the country,
age, gender, anything that you can. You used to pick out groups from the
responses, any such information, what the demographics of the respondents look like. And it's important to explore these
variables mainly in order to understand how representative the survey is off
the worldwide programming community and of the worldwide population in general. And this is, again, the reason this is
important is because a survey of this scale generally tends to have some bias. So in the world, there are a
certain number of people who are programmers out of those programmers. A certain fraction use stack
overflow, and that's not a randomly uniformly selected fraction. So there's already definitely
some selection happening there. Out of those who use stack
overflow, a certain number of people have taken the survey. So once again, that is not a
uniform selection from the entire user base of stack overflow. There's probably some selection bias here. People who are more likely to
take a survey are probably also people who are, let's say have
three or four other qualities. So it's not a random sample,
completely random sample. And then there's also how stack
overflow publicized the survey. What was the outreach process
and all that besides who saw the survey and who filled the survey? And also like the language of the survey,
the kind of questions that were asked the length of the survey, all of these
things make a big difference in terms of who has actually filled the survey. And all of this is called selection
bias where the respondents of the survey do not come from a randomly picked
sample of the overall of the overall population that you want to study. So do keep that in mind, and
that's why it's important to first look at the demographics. Okay. So we're going to do what is
called exploratory analysis and visualization, where we don't
really have a question in mind. We are simply looking at
different roles, columns, comparing things, blurting graphs. And since we will be plotting graphs,
we will be using matplotlib and Seaborn. So here, what I'm doing is I am
importing matplotlib and Seaborn, and I've said some basic styles. Like I've used the dark
grid style from seaborne. I have increased the font size and the
figure size for my plot lips so that we can see the figures a little more easily. Now, if you want to understand what all
of these mean, I will refer you back to the previous lecture on data visualization
with Matlock, lib, and Seymour. Okay. So with the imports out of the way,
so let's first look at the number of countries from which there
are responses in the Surrey and maybe plot the 10 countries with
the highest number of responses. So there was a question where do you live? And that, so the column
name for that was country. So we just see that from the schema
country corresponds to where you live and from the survey data frame,
if we just get the country column and then call dot N unique on it. So that's going to give us the number of
countries that are there in this dataset. So respondents from 183 countries
have answered questions. And that's pretty good. No, that's quite, it's quite wide,
but it might be better to look at what the distribution of responses
across countries looks like. Now we cannot plot the entire
distribution for 183 countries. So maybe what we'll do is we will simply
look at 15, the top 15 countries from where we had the maximum responses. And the way to do that is to take
the survey df.country, the column, and then use dot value counts on it. So dot value counts is going to
well, let's see what that does. So you can see here dot value counts. That's going to take each
distinct value on country. So each unique value inside country,
and then take that and count the number of values for each one. So here you can see that these
are the different value counts. So for each country, we
now get back account. And this was a series, essentially. And then from that, and
this is going to be sorted. So you can also decide whether
you want this to be sorted. So you can say sort equals true, and
you can say ascending equals false, by the way, if you want to show
this documentation in line for any function that you're using, all you
need to do is press shift plus tab. So here on pressing shift plus tab,
and that shows me the documentation. Okay. So we have the value counts and then
we pick the top 15 countries out of it. So those are the top countries you
can see here, the United States at the top, then India, then Yuki. And it's good to look at it this
way as a table, but you cannot really grasp what is the difference? Visually what is the
distribution looking like? And that is where a bar
chart will probably help. So we can use the index of
the series as the x-axis. And then we can use the values
in the series as the y-axis. So we are going to create a figure here. So I'm just going to set up
PLT road, figure fig size, just to make this a big figure. We are going to set a title for it. So we are going to set the title as the
question that was asked, and then we're going to use seaborne dark bar plot. So Seaborn has been imported as SNS
and we use the bar plot and for the bar plot, we give an x-axis and y-axis. So the x-axis, we are going to use
the index, the names of the countries and for the y-axis we are going to
use the values, which is the number of respondents from each of these countries. So there you go. Now you have the graph. Now, before we analyze this, I
just want to say by default, these labels are printed horizontally, and
if we do print them horizontally, there will be a lot of overlap. What we've done here is we've caused,
we have called the PLT dot X tax function and we set rotation to 75. So that has taken these labors
and then rotate it at 75 degrees. And because of this, now you
can see that they're slightly slanted and we can all read them. Okay. So now we have the graph. So looking at the graph, it seems
like a disproportionately high number of responses came from
United States and India, right? So the United States has 12,000 responses. And then India has 8,000, which is
about, only about seventy-five percent of the number of responses from the
U S and then next is United Kingdom, which is less than half of India. All right. And that already tells you that probably
this survey is not really representative of programmers around the world, maybe
a large 12,000 plus 8,000 plus 4,000. So that's about. 24,000 out of 65. So that's about more than one third,
more than 40% of the responses have come from these three or four countries. And you might know if you think about
this a little bit, it makes sense because one, this survey is in English,
so it was only conducted in English. So therefore. The programmers from non English
speaking countries probably did not get to hear of it. Second stack Overflow's user base
is shackle is also a platform that is completely in English. So it's user basis, primarily
from countries where English is spoken in in different professions. And those happen to be the top
three countries, United States, India, and United Kingdom, where we
speak English on a day-to-day basis for, in our professional lives. Okay. So that's something to consider that the
survey may not be representative of the entire programming community, especially
the non English speaking countries. And it's something for
stack overflow to consider. Maybe they should try translating their
question and answers into different languages, and maybe they should also
translate this survey into different languages so that they can get a
more represented creative design. Okay. Now there's an exercise for you here. What you can do is you can try
finding the percentage of responses from English speaking versus
non English speaking countries. So here I've linked to a list, a
CSV file, which contains the list of languages spoken in different countries. See if you can combine that data with
this data to identify, to create a new column that says English speaking, and
maybe it contains yes or no, or true or false for the English speaking column. And then see if you can see how
many responses are from English speakers, how many are not. Okay. So that was about the different countries. The data came from probably the next thing
that we can study is the distribution of the age of the respondents. So the age is another
important factor to look at. And this time, because age is
numeric, we can probably use a histogram to visualize it. So we are going to use PLT dot hist and. Let's just check what the age,
what the question about age was. So the question was,
what is your age in ears? And a lot of people may have
preferred not to answer this question. Fortunately what happens is when we
use these matplotlib and functions, any empty values are automatically ignored. Now we have so we are calling
PLT dot haste and then do it. We are passing the survey. Do you have dot age? So this is the column containing the
age values and these are all numeric. Remember? So then we're also going
to set a number of bins. So what we want is we want to take the
ages starting from 10 because we've removed and everything below age 10. So we want to start from 10
and we want to go up to 80. You could also go up to a hundred, if you
wish let's maybe change that to a hundred. And we are going to split this
entire range of 10, 200 years into ranges of five years. You could split this into ranges of 10
years as well, but we'll split it into 10 years and then we will count the
number of responses in each age group. Okay. So this is what the chart looks like. So it seems like that there are very
few responses below 15 years of age. And a little more, a little over
2000 above 15 years of age, but the maximum bulk of the responses seem
to be in the range of 20 to 45 years. Maybe you could say 15 to 50 years. So that seems to be the sort of
the professional lifespan of a programmer to a large extent. But on the other hand, you can still
see thousands of responses above and beyond the age of 45 and 50. It's, common that a lot of people tend
to fall into this, age range of 20 to 50. But on the other hand, you have people
from all over all the way, going up to close to even 80 years of age. Okay, so that's good. Now we understand the distribution of
age and roughly this is representative of the programming community in
general, especially since a lot of young people have taken up computers,
science as a field of study or a profession in the last 20 years, right? Colleges now have
computer science degrees. The number of jobs in computer
science have increased and most new jobs tend to be taken by younger
people and most new degrees as well. So that's why it is
slightly representative. And you can do some research on
how exactly representative it is and which age groups left out. And if you, and here's an exercise
for you, you may also want to just create specific age groups like
10 to 18 years, 18 to 30 years, 30 to 45, 45 to 60 and so on. And you may want to create a column
called each group, which can, which will contain based on the age. It will contain one of these values. And then what you can do is you can
repeat the analysis in the rest of this notebook for each age group. If you want, if you just want to pick out
your age group, let's say you're in the 30 to 45 age group, and you just want to know
what programmers in your age group think. Then you can just do this
analysis for your age group. And that'll be an
interesting thing to try out. That's a project idea right there. Then let's look at the
distribution of gender. You've already done a small
simplification here where we've excluded multiple responses. And now it's a well-known
fact that women and non binary genders are underrepresented
in the programming community. So we might expect to see
a skewed distribution here. So if you do schema or gender
here, it looks like the question that was asked was which of the
following describes you if any. So people were open to leaving it blank,
and then we have the gender counts. So now we're taking sovereignty
of dot gender and then using value counts here and already, you can
see that there is a huge drop. There are 45,000 responses from where
people have selected man as the gender. And there are about 38, 3,800 only 3,800
values where people have selected women. And let's also include in value counts. You can also include
drop any equal to false. What that does is that also tells you how
many people have big, nothing, no value. Now, what we can do is we can use it pie
chart to visualize this distribution. So once again, we take these gender
counts and we use PLT dot PI and we can then give it an index. We can give it, give them some labels as
well, and we can give the charter title. So here you go. Now it seems like around 71% people
have big man only about 6% big woman, and only about 0.6% have big nonbinary,
gender queer or gender nonconforming. And if we exclude the non values, so
if we exclude the nuances, so among the people who have answer, it seems
like 92% closed over 91% are men. And. This number is actually a soda. So that means only about
8% are women or non-binary. And this number is actually a far more
secure than the overall percentage. So the overall percentage of women in
non binary genders in the programming community is estimated to be about 12%. So this number is still has some
skew and what this number tells us in general, and even the overall figure. Is that there is a diversity problem
in the programming community that we definitely a 50% of all people are women. And then there's a larger number. I think I'm not sure about the number,
but a larger percentage of people also identify as one of the non binary genders. So we definitely need to have more
representation in the programming community and we should support people
from underrepresented communities and. Encourage them to be part of
the programming community. Okay. And an interesting exercise for you
now to do, would be to compare the survey responses and preferences
across genders and repeat this analysis with those breakdowns. So for each graph, maybe instead
of, for each bar chart, try to show side-by-side men versus
women and see how things differ. So for instance, you could try
and figure out how do the relative education levels differ across genders. Are do women hold similar degrees
in terms of percentage or do their whole higher degrees. And you may be surprised
how do the salaries differ? That is another thing to figure out. We definitely know that
there is a gender pay divide. There's a gender pay gap. So maybe you can discover that. And there's a, column
which talks about salaries. And then you may find there's this
analysis on gender divide in data science. And you may find that useful. That is also an exploratory
Data Analysis project. If you want to explore that a little bit. So that was about gender. Now. Let's talk about the education level. Now. Formal education in computer
science is often considered an important requirement of becoming
a programmer computer sciences. One of the most sought after degrees
and both in bachelor's and in masters. Let's see if this is indeed the case,
because on the other hand, a lot of you may have learned programming on
your own, and there are many free resources and tutorials available
online to learn programming. So what we do is we will use a horizontal
bar chart to compare the education levels of different respondents. So you can just check here
schema.education level. So there's an ad level column. The question was which of the following
describes the highest level of formal education that you've completed. So keep that in mind. This is the highest level. So what we are going to
use now is account plot. So what's account plot. We can check the doc documentation. So the count plot shows the
counts of observation in each categorical bin using bars. So what that means is if you
check the different values that education level contains. So if you just check. Schema DF sorry, if you just take
survey, DF dot ed level dot let's say, what are the unique values? So these are all the unique values
and the count plot is going to tell us for each of these values,
how many observations are there? Like how many times was this particular
option picked or how many times is this particular value show up in that column? Okay. So we pass to count plot sorority
of.education level, and you can just do this, but that is going to. Make vertical bars. And if you just want horizontal
box, you can just pass the Y equals 70 of.education level. So this is what the graph looks like. Maybe we should increase the
size of the figure a little bit. So let's just set PLT dot figure fig size. Okay, this is a lot better. So the question is which of the
following describes the highest level of formal education you've completed. And now you can see here that it seems
like out of the 65,000 respondents, about twenty-five thousand more over 25,000
hold a bachelor's degree, and then another close to 12,000 hold a master's degree. And then there are a few more,
which hold a doctoral degree, probably about a 1500 or so. So all of these three combined, it
seems over half of the respondents, whole half of the respondents hold
a bachelor's or master's degree. So most programmers definitely seem
to have some college education. Definitely, maybe some STEM education
where it's called some kind of a STEM education, but it's not clear from
just this graph alone, whether they hold a degree in computer science. Okay. So let's dig a little
bit deeper into that. And one problem with this graph is that
here we are showing the absolute numbers. And probably what we really want
to understand is percentages. So one exercise for you is to convert
this graph, to just show percentages instead of the full numbers. Okay. And that will probably give a clearer idea
because we probably want to know out of them, number of people who have responded
to this question, What percentage have mentioned that the bachelor's degree
is the highest degree that they hold. And that is probably the more
relevant question to ask. So you can try and modify this
code to just show percentages. Okay. But keeping that aside, we could tell
them over half of the respondents hold a master's or a bachelor's degree. All right. So now let us then take and
plot the undergraduate majors. So this time we will look at schema
dot undergrad major, which was what was your primary field of study? And we will then convert these
numbers into percentages. So to convert these into
percentages, we take the value counts for each of the values. So here we say, survey DF dot
undergrad, major dot value counts. So for each major, like computer
science major use, you have 31,000 responses and so on. So what we can do is we can divide
that by the total number of responses given for undergraduate majors. So let's just take the survey,
be able to undergraduate major and, call dot count on it. So dot count is going to count
the total number of values. So if we do that, then you can see
that now we get back a fraction. So for each, major that was provided
here, we get back a fraction. Point six one and so on. And if we multiply that by a hundred,
we are going to get back a number. Are we going to get back a percentage? All right. So now we have a 65% 61% have computer
science and another, a 9.3% of big, another engineering discipline and so on. All right. But it's probably better to
look at that using a graph. So we just put this result into a,
variable called undergraduate percent and then use sns.bar plot to plot it. Okay, so now here we have it. So in terms of the primary field of
study, it seems like out of the people who've responded over 60% say that
computer science or software engineering or computer engineering was there. Now the one way to, no, this
is like a glass half full half empty kind of a situation. The way I would interpret this is
that close to 40% of programmers holding a college degree. Have a field of study other
than computer science. Which is very encouraging a lot
of people after college feel that you cannot switch your field. That is definitely not
true for computer science. If you want to get into computer
science and you have some sort of formal education, some sort of STEM
education, you can absolutely pursue it. There are so many online
sources like close to 40% of people who are in the domain. Art from art, from streams,
other than computer science. So I think this is a
very encouraging sign. And I think this number is only going
to go higher because there are better and better resources available. And programming is blow up
is proliferating into pretty much every domain now. So you do a little bit of programming,
no matter what you study and that equips you to become programmer as well. Even data science for example, is
primarily a lot of programming. So, what we understand in general
is that while college education is helpful in general, you do not need
to pursue a major in computer science to become a successful programmer. And one trend that you might have
noticed here while we are doing exploratory analysis is that first,
every time we plot some graph, there is some background to it. There is a reason why we are
exploring a particular column. And this is something that we've
explained here that we want to understand. Like the reason we are looking at
education level is to understand whether formal education is important or not. So have some background, have some,
have something in mind when you are exploring a particular column. And then once you've explored that
column, once you've plotted a graph, try to gain some insight from it, try
to make some kind of an inference or an observation or a hypothesis based on it. Sometimes you may need to then
go do more research to identify if your hypothesis is correct. And in other cases, it, the
inferences can become pretty clear. For instance, in this case, it is
pretty clear that a lot of people do not have a computer science degree. All right. And that is what that is. The best part of exploratory
analysis is that you get to found all of these interesting inferences. Each time you draw a graph,
you learn something new. Each time you look at a column,
you learn something new, and then there's an exercise here for you. There's a column called new ed IMPT. Let's see what that column is. So that's, let's see schema.new ed. Okay. So that column is that
called arc question. And the column is how important is a
formal education, such as a university degree in computer science to your career. Now, what you can do is you can take
this column and analyze the responses, the distribution of responses for
people who hold some college degree or. What's this, sorry for people who
hold some, hold a computer science degree, versus those who don't. Okay. Try and analyze these results. So how many people, what percentage
of people who hold a computer science degree have selected that
a formal education in computer science is important to the career. And what percentage of people who
do not hold a computer science degree have selected that formal
education is for their career. And see if you notice any
difference in opinion. My guess is somebody who holds a
college degree may value it highly, but somebody who does not hold a
college degree, but still becoming a programmer will probably say that. It's probably not that important. So do check it out. Do discover there are more
insights to be gained here. And then one last area that
we will look at is employment. Now. Especially in among programmers,
freelancing contract work and part-time work is slowly
becoming a more popular choice. So it would be interesting to see
the breakdown between full-time part-time and freelance work. So maybe let's visualize the
data from the employment column. So the employment column was which
of the following best describes your current employment status. And what we've taken here is
once again, we are going to plot a simple horizontal bar chart. And one of the things that I want
you to take away from this is also that simple charts are often
good enough for a good analysis. You do not have to come up with a lot
of fancy charts, although it's good. If you can find the best chart for
the best kind of graph for every a statement of work for every
situation, but even simple bar charts, line graphs, scatterplots can give
you a lot of information, right? So you are very well
equipped at this point. If you've worked through this course,
So now we look at the sorority of donor employment dot value counts, and we're
just setting normalize equals true. So when we said normalize equals
true, that also gives percentages. And then we're also going to
convert sorted in ascending order and can convert that multiply
by a hundred to get percentages. So normalized gives fractions, and
then we can load that into percentages. And then we use the
Panda's plotting function. So dot plot. And this is just to show you a
variation of different ways of plotting. And we are going to plot a horizontal
bar chart with the green color. Okay. So now you see that among the people
who have replied to this question employed full-time seems to be about 70%. So 70% people are employed full-time
among the respondents, but there are a fair number of students as well. So there are about 12% of students. What you might want to do is
you might want to break down. And then there are people who are not
employed, but are looking for work. And then there are people who are
freelancers part-timers and then there are people who are just maybe their hobbies. They're not really looking for work. They're not employed or they're retired. So you might want to create a new
column employment type that contains values like enthusiast, which could
mean students or people who are not employed, but are looking for work. And then professional, which has people
who are employed full-time part-time or are freelancing, and then other, which has
people who are not employed or retired. And then what you can do is you can
see for each of these graphs, how will the preferences differ between students
and professionals between enthusiasts and professionals, especially some of
the things that we'll do after this, which is analyzing programming, language
preferences and things like that. So that's a good exercise to do in
any survey, in any analysis, all of these breakdowns offer a lot of
insights and the best way to do it is identify which group you lie in. Let's say you are parsing by gender or
by age or by your employment status, and then do the analysis just for
yourself and people like yourself. And that is going to give
you a lot more insight. Okay. Now, one interesting observation here
is that if you take away students Then, among people who are employed, it seems
like at least 10% of the people are working independently as contractors
or freelancers or self-employed for instance, startup people who are doing
startups or running their own companies. And that is pretty encouraging. That's a pretty high number of, for
a technical field like this, to be able to work on your own without
being formally associated with any. Com company. So that's an, that's a great thing
that it's also a way for you to try and break into the field. If you are looking to become a
programmer or looking to break into data science, maybe initially
you can try some freelance work. You can try maybe an internship, some
part-time work and help, and that can help you transition into a full-time role. So that's something to consider as well. Now in terms of what are the actual
roles held by the respondents? We can look at the dev type field, so you
can see here, there's a type Def type, which of the following describes you. And there are a bunch of different
values for the rules that are provided. Now, the problem here is that this
allows selection of multiple, values. So if you just check the Def
type dot value counts, or let's just do deaf type dot. Unique, you will see here that there
are a lot of different possibilities. So you can see here that there
is a developer, desktop or intra. Okay. Let's just say value counts. That's probably a little bit easier. Yeah. So you can see here that there are
some simple, options that were paid, like developer full stack, just
that, and then develop a backend. But then there is also, you can
select multiple options so you can select developer backend. And the semi-colon indicates
that multiple options are pigged. So about 3000 people have big three
options, develop a backend developer for antenna and develop a full stack
and about 2000 of big developer back in and develop a full stack. And then there are a lot of people who
have big, many different combinations. So it's not really clear even from the
data, how many options were available, but it seems like this person has picked
a whole bunch of different options. And so is this person and so on. So this has been, we might need
to do some more processing. Now we might need to take this column
which contains, a list of values, separated by semi-colons and maybe
split it out into multiple columns. Okay. And for that, what we are going to do is
we are going to define a helper function called split multi column that is going
to take a series or a column of data. Which contains lists like
values and lists of values. So data like a column like survey or
Def type and split that the values of that column into multiple columns. Okay. Now I will not go over the
code here and you can try this. You can try to run each line one by one,
and you can try to understand the code. So by this point, I hope that you are
well equipped to understand the score. Just run each line on a different cell. If you're not, you can also ask on the
forum and you can have a discussion where you can ask, you can share
where you're stuck or which part you don't understand, and you can
have a discussion to figure it out. But let us look at the output. So we know what the input
into this looks like. So the input into this, split multi column
function is going to be a series where people have picked either one, either none
or one or more than one options for there. Employment type or the role job role. So we can take split. We can call split multi columns on
surveyed or dev type and passing this column returns or data frame. So we get back a data frame dev type DF. And if we just check this data frame,
it seems like this data frame now has one column for each of the options
that were provided for the question. So now we have one column for developer
desktop enterprise applications. Then we have developer full-stack
developer, mobile designer, and so on. So there are in total 23 columns
about our 23 are about 23 options that were given for your job role. What we have is for each respondent,
we have either true or false. So for instance, for this respondent,
has not selected developer desktop enterprise, but this respondent
has selected developer full stack. And this, the respondent has selected
develop a mobile and they have not selected developer front-end or back-end. All right. So this is how sometimes we might need to
do a little more processing of our data. We might need to break one column
into an entire data frame so that we can do our analysis. And now that we have this data frame,
the dev type DF, we can now use this to identify what were the most common rules. Okay. And a simple way to do that would be
to simply count the number of throughs. In each column and you know that through
when it is converted to an integer, becomes one and false, when it is
converted to a number becomes zero. So what we can do is we can simply
take the column wise sums so we can just take Def type dot some. So we, now we get a column by
some, and then we can sort those values in a descending order. So that's going to give
us these dev type totals. So now you see that we have
developer backend developer, full developer, front end, and so on. So those seem to be the most common rules. And it's not surprising that the
stack overflow is primarily a tool used by developers and professional
developers for finding answers to small questions on writing code. So it's not uncommon that the
developer role that you see is the most common, one, but one interesting
thing for you to figure out would be what percentage of respondents work
in roles related to data science? And then you can probably also
try and figure out which role has the highest percentage of women. So what you can do is you can just
filter now, that you have this data frame, you can then merge it
back into the original data frame. So you can create a new merge data
frame, which contains the columns from survey DF, but also contains
these columns for each role. And then you can try and find out which
role has the maximum percentage of women. Okay. That's an interesting
thing to figure it out. So with that, we end our exploratory
analysis and we've only explored a handful of columns from the
20 columns that we selected. We've only explored about five
or six, so you can explore and visualize the remaining columns here. You have some empty cells
and you can always add new cells using insert cell below. So please do that. The more you explore,
the more you'll learn it. It's possible that while you work
through this notebook, you find five or 10 interesting columns. And you just want to do a
project using those columns, and that's perfectly acceptable. You can use this dataset for your project. Just do not repeat the same
analysis that is done here. Do something a little more interesting. And before we continue,
let us upload our work. So from time to time, keep
running jovian.commit so that you do not lose your work. All right. So now we come to a slightly
more interesting part. Although I think the exploratory
analysis was pretty interesting as well, but now we can ask some
questions and then answer them. So we've already gained several insights
about the respondents and about the programming community in general,
simply by exploring individual columns. But let's ask them specific
questions and try to answer them using the data frame operations and
using interesting visualizations. Okay. So the first question that we'll ask is. What were the most popular
programming languages in 2020. So this survey was conducted
in February of 2020. So this is technically
2000 nineteens data. But see, so we see schema dot
languages worked with, so the question asked was which programming,
scripting, and markup language have you done extensive development work
over in over the past year, right? So which languages have
you used in the past year? But this is a two-part question. And the second part is, and which do
you want to work in over the next year? So here, the respondent were presented
with the list of options and then for each option, they had to check
boxes to the first check box. They will take for this part, whether
they've worked with it in the past year and the next check box they would
take for this part, whether they want to work on it over the next year. And the responses were then taken
and they were broken into two. They were broken into two columns. So we are, we have the language worked
with, which contains the answer to the first part, which language have
you worked with in the past year. But then we also have the language
desired next year, which has the exact same question, but this contains
the responses to the second part. So I'm showing you this because this
is something interesting that nobody with real world data and especially
with how surveys are conducted. You might often have this and without
the context, you might not understand what's the difference here, because
it seems like languages worked with and languages desired next year. Have the exact same question. So you may want to just go through the
read me, or you may want to take the survey or serve to understand that,
okay, there are two parts and the first part is covered in the first column. And the second part is
covered in the second. Okay. So putting that aside let's just look
at what the, what some values in the languages worked with column look like. So it looks like once again,
people could select multiple options, multiple languages. So you can see that the first
person is selected C-sharp and then HTML CSS, and then JavaScript. So they are separated by a semi-colon. So this is similar to the dev type field. So the first thing that we
might want to do is just split this into a multicolor them. So we just call it split multi column
on seri DF dot languages worked with. And then we can see the languages work BF. So this is another data frame
where we have 25 columns. So it seems like 25 languages
were presented to the respondents. And now for each of those, for each
respondent, we have, we can see through false, so true indicating
whether the, whether that respondent has used the language and false
indicating whether they have not. Okay. So now going back to the question. Which were the most popular
programming languages in 2020. So all we, what we can do is we
can try and identify percentages. What percentage of people
have selected JavaScript. And then what percentage of people have
selected Swift and Python and so on, and then pot plot that as a bar chart. Okay. So once again, to get these percentages,
we simply say languages work. DF dot mean? So this is another way to do it because
true becomes one and false becomes zero. So if we take the mean of this
entire column, if you take a column wise, mean we essentially, we get
back the person or the fraction of true values in the column, right? So the mean is simply sum
of all the values divided by the total number of values. So since the zeros or the falses
go away, that is basically the number of proof values divided
by the total number of values. And that is essentially the
percentage, of Pru, right? So we take, or the fraction of true. So we took convert the
fraction into a percentage. We multiply by a hundred and then we
sought values by in descending order. So that gives us the
percentages of each language. So it seems like JavaScript is the
most popular language followed by HTML, CSS, and sequel and so on. And let's visualize this once
again, using a horizontal bar chart. So now it seems like the
languages used in the past year. Once again, JavaScript was the most
popular language followed by . And this is no surprise because today a lot
of software has moved on to the web. Like you probably spend most
of your time in the browser. Even the Jupyter notebook
platform that we're using is actually running in the browser. And the only way to write
code in the browser is one. You have to write HTML, CSS,
and second, you have to write a Java script for interactivity. Now, JavaScript might be higher
than HTML CSS, because you can also use JavaScript on the server side
using a framework called node GS. And because of all these reasons,
JavaScript is the most popular language because it is the sort of
the defacto language of the web. And then we have so again, plotted
this chart and based on this chart, we can make some inferences, right? Then we have, now once again, today, all
applications need some kind of a database and the most popular form of databases
what's called a relational database or what is called a tabular database. So a lot of the data, let's say
your the, data of your Facebook accounts, the read of your Twitter,
the data of your Instagram. Or any platform that you use, the
data that you put into Jovian? A lot of the data is saved in sequel
databases and the way to interact with these databases using the SQL language. And that is why sequel is
probably pretty popular as well. But beyond that, if you take away web
development and take away database access the actual application development,
like any backend application development or data science, a lot of these
other areas, any non-web related. Our development seems like Python
is the most popular language. And this is again, no surprise
because Python is a well, Python is a general purpose language
and it has beaten out Java. So Java, the de facto language for
all development pretty much for about 20 years, but it seems like
Python has now beaten our Java. So it's a good thing that
you're learning Python. It is definitely an in demand language. Okay. And there is a whole wealth of information
that you can gather just from exploring a little bit deeper into this question. For example, what are the most common
languages used by students and how does the link list compare with the most common
languages used by professional developers? So is there a gap between
what students learn and what professional developers use? You might want to answer what are
the most common languages among respondents who do not describe
themselves as front-end developers? Because no front-end development,
you don't really have a choice. You have to use JavaScript. TypeScript is a choice, but it's
a small choice because if you just exclude front-end developers,
can you then answer, what are the most common languages people use? Can you try and find out what are
the most common languages used by respondents who are working in
a field related to data science? And maybe also see in terms of age
developers who are older than 35 years of age, or maybe developers
who have more than 10 years of programming experience, what are the
most common languages used by them? And what are the most common languages
used by people who are younger? Is there, do you see a shifting
trend and what are the most common languages used in your home country? That is something that you can
try and find out because there are responses from over 180 countries here. Is there a difference between
let's say us and India and China and different countries, then moving ahead. And another similar question is you could
ask us which languages are people most interested to learn over the next year. So for this, we can use the
language desired next year column a, which is an, which will have
pretty much identical processing. So we take the language design next
year and we split it into multi columns. Then we get percentages for each language. By again, once again, by taking
the mean and sorting the values and multiplying by a hundred. So you can see here, the
language interested percentages. These are the values. And let us just go jump
into visualization directly. The visualization is once
again pretty much identical. And it seems like that Python
is the language that most people are interested in learning. So we have Python JavaScript, HTML,
CSS, again, they seem to be closed by and followed by sequel and TypeScript. And it's no surprise that Python
is the most sought after language because it's an easy to learn
general purpose programming language. And it is very well suited for a
variety of domains like application development, numerical computing,
data analysis, and so on. And in fact, we are using this
using Python for this very analysis. So you are in good company. If you're learning Python, you can apply
to a whole host of different domains. What you can do is now you can repeat
the same exercises that we discussed for the most common languages. Just replace all of these questions with
the languages people are most interested in, so that those are some exercises. And you can also the next question, what
we lose we'll combine these two things. So the question yet is, which
are the most loved languages. That is where do you see a higher
percentage of people who have used the language and want to continue
learning it over the next year? Okay. So this may seem like a somewhat
complicated question to us. Okay. People who have used the language
in the past year and they also want to continue learning it. How am I going to figure that out? It may seem a little bit tricky. But it's actually really easy to
solve using pandas at AOPA using pandas data frame operations. So here's what we'll do. We will create a new data frame,
languages, love DF, which has the same structure as languages work,
BF and languages interested here. So again, it, for every call
language, it has a column. And then for every respondent,
there is a row and there is a true in there's a true value. Only if the corresponding value in
both languages work, BF and languages interested, DF are both true, right? If so, if somebody has worked in that
language in the past year and wants to continue using that language,
then we are going to put it in dead. And the way to do that, really simple,
all you do is you take the languages work there, and then you put in an ampersand
and then you pass languages interested. DF. What this will do is this will do all. Element wise and right. A Boolean. And so if two values are true to
respective values, then you will get two. And if two, if you have a true and a
false or a false and a true or a false and a false, you will get back false. All right. Except that this is going to happen
on a per element or a per value level. So now we take languages, love DF. And then we, look at it. So for example, this respondent has
proved side to side for C-sharp because this person has this person has worked
in C-sharp and they weren't interested in continuing to work in C-sharp. So that's the we're saying
that this is a proxy for saying that they love this language. Okay. So let's convert that into percentages. Now we want to take these numbers and
for each language we want to identify how many people love it out of the number of
people who have used it in the past year. So we take language, love dot some. And then, so that's a column by some,
and we divide each of the column by sums by languages, work dot some. So for the column, C-sharp we will
count how many people love the language. So how many truths there
are in this column? Divided by how many people
have used the language. And then we are also going to
multiply it by a hundred to convert it into a percentage. And we're going to sort
it in a descending order. And let's take a look at that. So you can so you can see here that
we have languages load percentages. You can see that for each language. We now have percentages. And you can see who the winner is here. So the winner seems to be
a language called rust and let us look at it in a plot. So the winner seems to be
this language called rust. You may not have heard of it, but it is. So it is a low-level language for
doing systems programming and it provides the performance of languages
like C plus but it provides many conveniences and a type system of. Some of the best languages things
like Scala and Java and so on. So it's a, pretty useful language. A lot of people enjoy using it. And it's interesting to see that a small
language with a growing community is the one of the most loved languages. And in fact, you can see the hints here. Now, if you see this graph rust seems
to be used in a very small fraction. So you see dust here, it's a
small fraction of people use it. It's a far smaller than let's say
JavaScript, but if you look at this graph, in terms of people interested
in using or learning rust, it is way up ahead, somewhere at the top, right? Almost close to about a third or
maybe a higher than a third of a JavaScript search definitely seems to
be an language that's gaining a lot of popularity and a lot of interest. So maybe if you're looking
for new language to learn, rust may be a good choice. And this metric that we just calculated
is something stack overflow calculates every year, based on their survey results
and rust has been stack Overflow's most of languages for four years in a row,
followed by TypeScript and TypeScript. Again is a language that offers an
alternative for web development. So these are things that you should
do that once you get, an answer, once you get a graph, maybe just
search online on why that might be. Why the result might be such,
so you can learn a little more about Rust in TypeScript. Now, what I find probably even more
interesting is that Python features at number three, despite already
being such a widely used language. And that's generally not true
for widely used languages. If you see JavaScript in terms of the
love score, it's fairly low in term and Java is far, lower, whereas Python. Has it remained at number three, right? So it seems like people who
use Python enjoy biting. And that is because the language
has a solid foundation and it is really easy to learn and use. I hope you've been able to learn Python. You've been able to you can now
say that you're comfortable with Python and just these six weeks. And then it has a strong ecosystem
of libraries for various domains, and it has a massive worldwide
community of developers. And that now includes you and me. So w who enjoy using it? So I've been using Python
for the last 12 years. I definitely want to continue using
it for the next 12 as well, and I hope you'll feel the same way. So that's about the most loved languages. We now have some insights about that. The next, a few exercise, simple
exercise that you can try here is to identify the most dreaded
languages, which is languages, which people have used in the past year. But do not want to learn
or use over the next year. Right? There's a small hint here. All what you can do is you can
simply inward the languages, interestingly called a data frame. So if you, and the way to invert
it is using this tilda operator. So just inward that data frame and then
do the same thing that we did here. So you should be able to answer the most
dreaded language and then see if your results you'll get the same result as
what the stack overflow results present. So you can always refer to them. So moving further along, next question
here is in which countries do developers work the highest number of hours per week? Okay. Now, to do this question, we will,
you need to use the group by function of the group by function of a
data frame, the group by method. And there's a small caveat here. We only want to consider countries
with more than 250 responses because. Otherwise, it's not really they're
presenting an average because there are definitely lots of countries
with thousands of responses. And if, we are setting a threshold of two
50, so that if there's a country where only 10 people have responded, we are
not going to consider it in terms of to get the average number of hours per week. So we dig and we group it by country. So what this does is this. It takes for each value of country. And there are 184 values. We take all the rows which have that
country related to that country and then can separate them out into groups. And so far we've not
performed any operations. So you don't see any result here now
for each of these rows, the column that we are interested in is workweek hours. So we select the column work
week hours, just as we select the columns over data frame. And just as an example, let's
select the age column as well. So now once again, from these, for
each of the groups, we have all the rows and from these rows, we've only
selected the work we can age columns, and now we need to aggregate them. So one way to aggregate
them could be using a mean. So if we use a mean, then we can see
here that work week, hours, and age. So we will get back a new data frame
where now the index is the name of the countries, all the different
unique values and countries. And then the values for
work week are an age. Are the averages from those groups. So all the rows from Afghanistan,
we've taken an average of the workweek covers and put that value here. Similarly, all the groups from
Afghanistan, we've taken the age and we've taken away all the rows from Afghanistan. We've taken the age and we've
put in the average value here. Okay. So this is how you grew by, and you
can learn more about this in the pandas lecture, which is lesson four. Now, what we want to do though, is
we want to only look at countries which have more than 250 responses. So the way we are going to do
that is first we're going to create a country's data frame. And this is exactly what we just did
grow by country, but just keep the workweek hours and take the mean and
then sort to aggregate and then sort by workweek hours in a descending order. So that's a country DF. Okay. So you can see that these are the
countries with the highest, these are countries with the lowest,
but it's possible that a lot of these countries, probably the
number of responses is really low. So what we will do is we will create a
new data frame called high countries, DF, where we will take these, where
we will only select the rows where the value counts are greater than 250. So we are getting 70 of.country. And we're trying to find the value
counts of each country, the number of responses from each country. And then we're filtering out
those only where the responses are greater than two 50. And then we use their dot LOC
dot look function to pick values from the countries DF with only
value counts greater than 52 50. And then we pick the top 15 out of those. So let's see now. So now we have the top 15 countries. Okay. And once again, if you do not understand
this, there are a couple of things you can do revise the pandas lecture. That's one thing you could do. Look at the documentation of countries. They have dot LOC, not log and
split this into small parts. So first take this already of the country
that value counts, run it in a sale. Take that and compare that with two
50 and see what the result of that is. And then put that into countries. They have not look and see what the result
of that is, and then add in the head. So with all of these things, it's a
question of breaking them down step by step, and then the more you break
them down, the more you understand, and the more you will be able to use them. It's now we have the
high response countries. And these are the 15 countries with
the highest number of working hours. It seems some saltation countries,
some Asian countries like Iran, Israel, and China have the highest
working us followed by United States. So that's intense. And then we have Greece. So people are probably working a lot. Programmers are working a lot in Greece. And once again, we see a bunch
of Asian countries all the way til actually a major majority of
these seem to be Asian countries. And then there are a few European
countries and then there is United States. But overall there isn't
too much variation. If you see 44 is the highest value
if you just take the first three as outliers, then here you have 41 to 40. So on average, there are only
about no people are working at about 40 hours per week. There's no variation where on there's a
country with an average of 60 hours or a contributing an average of 20 hours. At least it seems so from the top 15. Now one, a few exercises that you can try
is try to compare how the average working hours compare across different continents. So you may find this a list of
countries in each continent useful trying to find out which role has
the average number of has the highest average number of hours work per week. Out of all the developer or out of
all the roles that we looked at. So you may need to merge with the dev
type data frame that we created with, which had one column for each role and
try to find out, try to find it, maybe how the hours work compared between
freelancers and developers working full-time well, full-time developers. It's possible that you might find
that the average is around 40. But freelancers, the, one of the
reasons people take up freelance or even part-time work is because they
want free time to work on other things. So let's try and verify if this is true. Do freelancers work less or more,
and that could help you maybe even decide if you want to choose between
a freelance or a full-time role. Okay. And then let's ask one more question. How important is it to start young,
to build a career in programming? Okay. And this is again, something that
is a question that a lot of people wonder, not just about programming,
probably about data science. And in general, about any
field, can you enter this field? Let's say beyond if you
have not done it in college. And I think we've established that even
if you have not taken it in college, you can still enter the field, but can
you enter the field if you worked in a different domain for a few years? Can you enter this field in, your
thirties and your forties and so on. Okay. So what we'll do to answer this question
is we will create a scatterplot of age watches, the years of coding experience. So that is your score pro is
asking, not including education. How many years have you coded
professionally as a part of your work? Okay. So what we lose we'll plot age
on one access on the x-axis and then we'll plot the. Years of professional coding
experience on the y-axis and that should maybe give us a hint So here is a chart and
this is what it looks like. Now, if you look at this, it seems
let's, look at some values and let's try to understand this chart. And we've also put in a color's year
and we'll touch that in just a moment. But if you see here in the, let's
say you're looking at age 14. So at age 14, there are several
people who have less than 10 years of programming experience. And there are even at every age, all
the way from around 15 to even close to 50, there are people who have
just started working as programmers. So what that means is that you can start
programming professionally at any age. There is no restriction that
you have to start early. People are starting in their
twenties, in the thirties and the forties, and everybody's welcome. And these are pro this is
professional experience. This is not just programming
experience in general. So you can, if you put in the work, if
you learn, if you're open to learning, if you're excited about it, then you
can definitely get into the, domain. And then we have also added a
color for each of these dots. So each dots represents one response. So we've added colors. So the color we've added is if
a person is a hobbyist, then we represent them with blue. And if not, we represent them with orange. And once again, it turns out that a lot
of the people who are programmers also say that programming is a hobby for them. And especially so in the initial years. So in the initial years to get through
the initial years, It will really help for you to just have programming as a
hobby, something that you do just to build things just to, solve problems,
just something that you do on the weekend. And if you do that, then you will
probably also have a long and fulfilling career in programming. So these are some inferences that
we can draw from the scatterplot. We can also look at the distribution
of the age first code column. So just to see when people have
tried programming for the first time, and as you might suspect, a
lot of people have probably have some exposure to programming. Maybe they've written a first line
of code, maybe a simple HTML page, or maybe just a hello world program. And Python's hello world
is just this, right? So you can just type Python into
a terminal and you just say print. Hello word. And that's the first program. But that by itself, no tells
you what programming is. So it seems like a lot of people have been
exposed to programming in their teens. And it depends on pretty much every
field you end up writing some code. So in Excel, you write formulas in
different streams of engineering. You probably use MATLAB or. Maybe some kind of a numerical
computing package, of course, in computer science you write, programs. If you're doing right now, it
pretty much every field, there is some code that you're writing. So it's not surprising that
people get some exposure to programming at an early age. But then there are also people who
have experiences after a certain age. Like you can see a lot of people
doing so after the age of 30 and after the age of 40, and there is
like a small number of people who have even exposed, become exposed to
it all the way up to the age of 18. So there are people from all ages and
walks of life who are learning to code. Okay. Now. Here are a few questions
that you can try and answer. How does the experience change
opinions and preferences? So maybe what you can do is you can
repeat the entire analysis while comparing the responses of people
who have more than 10 years of professional programming experience. Like we just do the use which
languages do they want to learn? With those who do not have that. And this is going back to like
students versus professionals. Now you're going to have three
categories, students professionals and experienced professionals. And do you see any interesting trends? Do you see what you're, what one might
call a generation gap in terms of the quarters from the old days versus
quarters, people who are learning it right now what, kind of roles do they occupy? What kind of languages do they prefer? And maybe you can also try and compare
the years of professional coding experience across different genders. Just to see if my guess
is you might see that. No, although minorities are
under represented right now. There are very few women, but my guess
is you will see that there are more women now than there used to be earlier. So it's definitely things are improving
and you can try and validate that. So that's one way, if you can try and
compare the years of professional coding experience across different genders. So with that, we have. Barely scratched the surface here. We have almost been talking for
about 90 minutes and we've already gotten a huge load of insights. And hopefully you're thinking of many,
more questions that you would like to ask and answer using the data. Now we have not even used all
the 20 columns, only about 12 or 13 columns have been used. And then there are another
45 columns to pick from. So you can use these empty cells below
to ask and answer more questions. So I let you try out and you can
try out all of these exercises. So there's really no end to this. The more you, do, the more you
experiment, the more you exercise these skills, the better you get at it. Now I've used fairly simple charts. I've not done many breakdowns. I wanted to leave that as an exercise
for you, but try and replace each chart. So each chart or each graph that we have
tried to use a different kind of graph. And maybe go through the Mac plot
lab gallery, go through the seaborne gallery and try and pick, which might
be an interesting graph to draw there. So these are all different
exercises for you to try. Data analysis by itself is a, there's
a lot of depth in the field and you can probably spend at least a few months just
exploring different ways to slice and dice and analyze and visualize the data. So please do that. Okay. The best way to learn is by doing. Now what we'll do is we'll just
summarize some of our inferences and conclusions, and this is always a good
thing to do at the end of your analysis. So here's some of the summaries, like
based on the demographic data, we can infer that the survey's somewhat
representative of the overall programming community, although it definitely has
fewer responses from programmers in non-English speaking countries and
from women and non-binary genders. We have also learned that the
programming community is probably not as diverse as it can be. In terms of gender in terms of age, maybe
in terms of the different languages or the different countries that are there. So we should probably take more efforts
and support and encourage members of underrepresented communities and racism. Another thing that we've not looked
at but that's another factor where there's a lot of disparity and we've
learned that most programmers hold a college degree and although a fairly
large percentage of them did not have. Computer science as
their major in college. So a computer science degree, isn't
compulsory to learn to code or to build a career in programming, but
some STEM education definitely helps. And a significant percentage
of programmers either work part-time or as freelancers 10%
is actually a pretty good number. And this can be a great way for you
to break into the field, not just in programming, but also in data science,
which are very closely related fields. We learned that JavaScript
and HTML are the most popular programming languages used in 2020. And then we learned that Python
is the language, and most people are interested in learning. And we've learned that trust in TypeScript
at the most loved languages, both of which are small, but fast growing communities. And finally it seems like programmers
around the world seem to be working on 40 hours on average, but there
are slight variations by country. And finally we learned that. You can learn and start programming
professionally at any age. And you're likely to have a long
and fulfilling career if you also enjoy programming as a hobby, and
especially if it's going to help you during the first few years. Alright, so that's our analysis. And as I said, there's a wealth of
information to be discovered and we've barely scratched the surface. So there are a few more ideas
that I wanted to share with you. You can repeat the analysis
for different age groups and genders and compare the results. Specifically try and represent
try and pick a slice of responses that represents you late. So maybe the country or from the gender,
the age group that you're in and try and see what the preferences of people are
and see if that reflects your opinions to try and choose a different set of columns. We've chosen 20 out of 65 and
we've used about 12 of them. So you can look at a lot of the other
columns, read through, go through the, read me and go through the survey. But repairing allows us
focusing on diversity. So identify the areas where
underrepresented communities are at par with the majority. So you might see that in education,
actually, there's probably not a big difference in terms of
the percentages of degrees that. Are different degrees held by people,
but then there are places where there are differences, like salaries. You will see that there
is definitely a big gap. And you can try and validate that
and try to compare the result of this year survey with the previous year and
identify some interesting trends because this is data that you get every year. And once again, you can
go back to this link. Stack overflow.com inside
store stack overflow.com. And you can download the
raw data for every year. Now, one interesting exercise for you
to do is to see the survey results. So you can see that this is a pretty
long analysis that they've done a whole bunch of analysis on pretty much the
similar questions that we have answered. So see the survey results and
try to replicate the survey results graph or graph, right? This could be a great way for you
to just see if you are doing the same kind of data, cleaning and
analysis and simplifications and see how that affects your results. Now, if you can replicate all of these
results, now that's a great sign that this is real world data, and this is a
large dataset and this is a deal analysis. So that's a real sign that you've
done something significant. In data analysis and you can
proudly then showcase that on your professional profile. Then we have this I just want
to share a few references. Now we've used pandas
matplotlib and Seaborn. So you can refer to the previous lectures. We've linked you just go
to zerotopandas.com and you can find the lectures there. You can watch those lectures, or you
can also just read, go through the documentation and the user guide. So you have the pandas mat, plot,
lib and seaborne user guides. Also go through galleries on websites. So these galleries will show you all
the different types of charts that you can create using these libraries. And finally as I told you, we are
creating this open datasets library, Python package, which is a curated
collection of datasets for data analysis and machine learning. So, far we have about six, seven
datasets, but we will, we are planning to add about a hundred datasets here. So over the next few days, and we've
released this library just yesterday and it's something that we worked up
quickly to make sure that it's easy for you to download these datasets. So we are going to add about a
hundred data sets here, so you can use these a hundred datasets
for your course project as well. Okay. And that is the next step that
I want to talk to you about. Now we have. We've looked at this. What we saw today, the exploratory
data analysis is basically what you need to do on your course project. So you simply need to repeat this. You can repeat it with the same dataset
and ask different questions, do different analysis, big, different columns,
or you can pick a different dataset. So let's open up the course project page. Now the course project deadline. Once again, I wanted to mine, I
knew the deadline has been extended to October 3rd, 1150 9:00 PM. GMT. So you have more than a couple
of weeks to work on this. And then the objective of
the course project is exactly similar to what we did today. You find the real world
data set of your choice. You use and binders to parse
and clean and analyze the data and you use matplotlib and
seaborne to create visualizations. And then you ask and answer interesting
questions about the data and an optional, but highly recommended step,
because you've put in so much work and. If you can just consolidate all
of your learning into a blog post to showcase your work, that is
something that you can do as well. So I just want to give you a
quick overview of the course project, and then we have a few
exciting things to close out. So now the course project, this is
a starter notebook and by the way, we've done a walkthrough of the course
project in last week's video as well. So you can just check that out too. So on the course project, you
can take the starter notebook. And just run the, start a notebook
on binder, and you can also run this starter notebook on your local computer. So you do not have to run it on binder. So I just want to show you how
to run it on your computer. So here I have a terminal. Let me just zoom in a bit. Yeah. So here I have a terminal and this would
be a terminal or a command prompt, or an Anaconda prompt that you would need. And if you want to download this
to your local computer and run it locally, then what you need to
do is click on the clone button. So click on the clone button
and just paste that command. Now you will need to run this command,
but then to run this command, you need the job in command line tool installed. So actually I'm just going to
skip that command and I'm just first going to run PIP install. Jovian. Upgrade. So that's going to upgrade the
Jovian Python library for me. And once the Jovian libraries installed,
you can see that we now have this command line tool called Jovian that you can use. So now once again, I can
copy this clone command. Come back here, run Jovian clone, and
just simply the title of the project. So username slash. The name of the project and presenter. Now that is going to download the files. So you can see that this, these
files got downloaded to my desktop, the zero to hundred score starter. If you wish you can change
the name of this folder. So let me just, if I'm, let's
say I'm going to analyze the. State of JavaScript survey. So I'm just going to call
this state of JavaScript 2019. This is the data that I'm going
to analyze for my project. So then I go into this folder,
state of JavaScript 2020, 2019. Now here, you need to install all
the different libraries and we suggested installing these libraries
inside an Anaconda environment. What you can do is you can manually
create an environmental conduct, create minus N and then simply give it a name. So let's say course project, and you
can set up Python version for it. So these are the same instructions that
are provided in each lecture notebook. So let's just create a, yeah. So let's create a condor environment here. So we have called conduct
create minus end course project. So that's going to create a
Python environment where we can install all our, libraries. Okay. So now the environment has been installed. Then we are going to
activate the environment. So he had conduct activate course project. So now the environment has been
activated inside this environment. We might want to install all the
libraries that we want to use. So we are going to use Jovian. We are going to use Jupyter. Let's say we'll use open datasets. Or you don't have to, but you,
might we are going to use pandas. We are going to use non-pay and
we are going to use seaborne and we're going to use Mac blot lip. So we just installed all the libraries
after activating the environment. And once these libraries are installed, Let's just give that a second. Yeah. Sometimes this might take awhile for
you to install, and this is one of the reasons we recommend using binder because. All of these steps are
taken care of for you. So once these libraries are installed,
we can now start a Jupyter notebook by typing, a Jupyter notebook. Okay. So once again, a quick recap of what
we did, the first thing that we did was we installed the Jovian Python light. Let me just come back here. I'll open this once again. So once you run Jupyter notebook, Yeah,
this will print out a URL for you, which you can open up on your browser. But a quick recap, the step one was
to install the Jovian Python library. And of course you also need
to have Anaconda installed. So make sure that you have the Anaconda
distribution of Python installed two step two is to then clone the notebook
using the Jovian clone command. Step three is to enter the directory
and create a container environment. So that is done using conduct lead. Step four is to activate the environment
and install all the libraries. So you say condyle activate the
environment name and the new install libraries using PIP install. And then step five is to just
open up the Jupyter notebook. So by typing Jupyter space notebook. So that's going to print this URL. So you just take this URL here
and open it, a new browser. And now you can see here we have the zero,
two Pender's course project or IPNB file. So now at this point, it is pretty similar
to the place that we get to when we click the run button and click run on binder. So run on binder is a
one-click experience. That's why we recommend it, but with
a few more steps, and I think now you are familiar with these steps. Now you are now I think you're
comfortable enough that you can figure these things out. If not, you can always ask on the forum. So with a few more steps, you can
run it on your local computer. Okay. So now we have this zero two course
of pandas course project file. And the first cell is a text
sale, a markdown cell, and you can remove the cell before submission. Now this is the cell
describes what you need to do. It gives you the guidelines. So the first step is to
select a real world data set. So you have to find and download
an interesting real-world dataset. And we have given some
recommendations for datasets here. If you see this. Forum topic on the recommended
datasets for the course project. So we have given you links to many places
from where you can download datasets. For example, there is Kaggle, there is
the UCI machine learning repository. It is a GitHub that is this
GitHub repository three, which has a list of datasets. And now of course, we are sharing
this open datasets library with you. We will be adding a more datasets there. So there are a lot of places
from where you can get datasets. And we've also picked out some
interesting data sets for you that you can download and use. So today we've used the stack
overflow developer survey, but there's also the COVID-19 data. This is updated on a daily basis. There's the state of JavaScript survey. And then there is stocks data, and
then there's Countrywide COVID data. There is some agriculture data. There is a data science job data. There is a well sports data. There is. Games data, video game data. So there's a lot of
interesting data to analyze. And then of course, there's
a lot of places where you can download your own personal
information as well and analyze it. So please go through this list
and try and identify something that you find interesting. So anyway, you take these datasets
and then you have to make sure that this data set contains tabular
data, preferably CSV or Excel files that it can be read using pandas. Then you perform some data preparation
and cleaning just as we did. So you load up the data frame. You look at the number
of rows and columns. You decide which columns you want to use. You decide how you want going to
handle any missing or invalid data. Maybe you might want to pass some dates. Like we pass some numbers. You might want to create
additional columns. You might want to merge
multiple data sets. Then you need to perform exploratory
analysis and visualization. So this is exploring the distributions
of numerical columns using histograms. Using bar charts to visualize categorical
columns using his scatterplots to see distributions across multiple
columns and take note of interesting insights from this analysis. And then also ask and answer
questions about the data. So you have to ask at least five
interesting questions and answer those questions by either computing
the results using non pipe binders or by plotting graphs using mat
plot, labor, Simone, and whenever you're using any library function,
just explain briefly what it does. It's always helpful for the reader and
then finally take your inferences and summarize them and write a conclusion. So this is a really important part
of consolidating everything you've learned into a single paragraph
or a single second, and also share ideas for future work on the same
topic using other the same dataset or maybe other relevant datasets. And make sure to share links to any
resources that you like might be helpful. For people reading the
reading her analysis. So definitely share links or
documentation, maybe share the link to the course, page so that
people who are not familiar with binders and unpacking use that. And then the last step is to make
a submission and share your work. So whether you're using binder
or you're running on your local computer from time to time, you
need to run Jovian dot commit. So you just set a project name. So for instance, my project name would
be, let's say s ateofjs-survey-2019 analysis, set up project name and
then use the Jovian library to run jovian.commit() and commit the project. And that's going to take this either
from binder or from your local computer and put this onto your Jovian profile. So you simply take this link then, and
then you need to take this link and go back to the course project page. And on this page, you need to
put in the link and click submit. So once you do that, you will see it
showing up in your submission history. So make sure that this is a Jovian link. So there should not be a local link. Don't put a local host link. Don't put a binder link, don't
put a Kaggle or CoLab link. Please commit to Jovian
and put up, Jovian link. And make sure that Jovian, that link is
to a notebook hosted on your own profile. Otherwise the submission will be rejected. Okay. And what we will do is then
we will evaluate your project. So what does evaluation look like? So here we have shared the
evaluation criteria as well. So we will evaluate that your
dataset contains at least three columns and one 50 rows of data. We will, you must ask an answer at
least five questions about the dataset. Your submission should also include
at least five visualizations. And your submission should include
explanations using markdown sales apart from just code, right? So just code is not good enough,
please write explanations. And that helps you understand your data
that helps you gather insights from you. And it is also going to
help others like tomorrow. If you want to showcase this project on
your public profile, or do you want to share it on LinkedIn or wherever, or do
you want to link to it from your resume? You want to make it nice. You want to show that you have Python
work, you are presenting it well because presentation, believe it or not is a
very important part of data science. So do not. So do not skip that. It's not just about writing code. It's about gathering inferences and
presenting them and making coming to no making interesting observations and maybe
making hypothesis and digging deeper. So the data tells you just some facts,
but you have to analyze it and really. Come up with what, that those
facts mean and infer them right? In the context of either the
dataset itself or for your company or whatever you're working on. And finally your work
should not be patronized. So do not copy paste from somewhere else. Of course you can take, you can borrow
functions and every data set has been analyzed by many different people. So you can do this analysis,
which people have already done. You can even look at those notebooks. You can look at notebooks by other people. But do not plagiarize. And I think you will be able to
tell best if, you are plagiarizing. So as such, the entire project
should not be a copy paste. And the biggest loser in that case
will be you yourself, if you're basically copy pasting stuff. So please don't do that. Now, apart from doing that,
do share your work online. So this is a lot of effort that
you've put in into this project. So please share it on your social media. You can just share the link. W once you've committed it to
Jovian, you can simply, where is it? Yeah. Once you've committed it to Jovian,
you can simply use the share button and share it on any of your, any of
these platforms, share your work on the forum as well, because there are tens
of thousands of course, participants. So we would love to see thousands of
projects being shared on this thread. So do share your work on the forum
and browse through projects by other participants and maybe give feedback. And that'll also be a
great way for you to learn. When you see other people
creating visualizations, similar or different visualizations, you
will get to learn from their code. You will get to learn from
their analysis and so on. So please do that. And one highly recommended step
is to write a blog post and a blog post is a great way to
present and showcase your work. So you can sign up on medium.com to
write a blog post for your project. It's really simple. You just sign up and then you click new
story and you can just start typing and you can simply, as a starting point, you
can simply copy over the explanations from your Jupyter notebook into your blog post. And in terms of the code and the
graphs, you can actually embed them. So you can take your code and graphs
you from your job in the notebook submitted to Jovian and you can embed
them within within your blog post. So you just watched this video for a
quick tutorial or just follow this guide. So you can see here that this is a blog
post on medium and inside it is embedded a code cell, or you can even embed. An entire you can embed even like
some code and the outputs, like things like graphs, or you can only embed
just the graphs if you want to see it. So the benefit of writing a blog
post is that it an over Jupyter, it contains a lot of code and, contains
a lot of pre-processing steps. But on your blog posts, you can decide
the narrative and you can simply use the right code blocks and you can write, use
the right graph, right graphs from your Jupyter notebook, and you can embed them. Within your blog post to tell
the story that you want, right? It doesn't have to follow the
same structure as the Jupyter notebook, and it will be a great
thing to showcase on your profile. It will be a great thing to share on
your on your social media, on your put up on your resume, or just mentioned
when you're applying for an internship or things like that, and you can
check out our medium publication. We've linked to it. For how to write a good blog post. There are many good examples. All of these have been written
by people from the community, and this was during a previous course. And in fact, all of these
are in a lot of these cases. These were the first or the second
blog posts written by people. So please do check it
out and don't be afraid. You can write it too. It just takes a little more
effort, maybe a few more hours after you do your project. Okay. But do write it as far, as possible. And we've, as you mentioned, we've
shared some recommended datasets and we've shared some example projects. So you can go through these projects
and you can keep revising this video as well, to just get a sense of what,
how you should analyze your data. And you can either use this
notebook as a starting point. We've created this template for you,
where you can put in the project title, write some introduction, and then
there are sections for each step and remember to commit your work at each
step so that you do not lose your work. Or you can also start
from a blank notebook. That is perfectly all right. It's all a question of what you feel
most comfortable with and do remember to just remove this cell before your
submission so that the instructions are not included in your final submission. Okay. So with that, we have done We've just
revised the course project as well. And you have time till the 3rd of
October, that should be sufficient time. So please do put in the work you've
if you've come this far and you should definitely do a course project while
you have this, all of these things in your head, and it will really reinforce
all the ideas that we've learned. Now, I just want to do a quick recap
of the course for a couple of minutes. And a lot of us started out without
even python programming experience. So we started out with an introduction
to programming with Python. We just looked at the four steps
with Python and with Jupyter notebooks using it like a calculator. We did, we explore data
types and variables. We saw brunching with
conditional statements and loops. And then in the second lecture, we looked
at rewriting reusable code with functions, working with the OSTP and the files. And then, and you saw the first assignment
where we solve some world problems using variables and arithmetic operations. We manipulated data types using methods
and operators, and we used branching and iterations to translate ideas into code. And we also learned how to
explore the documentation, how to get help from the community. So after learning Python, we looked
at . So we saw how to go from Python lists to number Aries, and we saw how
to work with multidimensional Aries. And we saw what are the
different area operations matrix operations that you can do? We learned about slicing. We learned about broadcasting NumPy by
itself is a very powerful light duty. And we also saw how to
work with CSV data files. Then we did an assignment on
NumPy array operations, where you explore the non-pay documentation. And demonstrated the usage
of five 80 functions. And you created a Jupyter notebook with
explanations, about five functions on how to use them and how not to use them. And we shared hundreds of notebooks
with the community and probably learned a lot from each other. Then we learned how to analyze
tabular data with binders. Which was reading and writing
CSV data with binders. We learned how to query and filter
and sort data frames and pandas data frames are really powerful. And even today we've seen a lot
of different functions, which we probably did not explore earlier. Then we also looked at grouping
and aggregation for summarizing the data we looked at merging and
joining data from multiple sources. And then we did an assignment on
pandas where we applied all of these. Things that we learn. Finally, we had one lecture on
visualization with mat plot live and seaborne, where we learn how to do basic
visualizations with Mat plot, lib and advanced visualization with seaborne. So things like line charts, scatterplots
bar charts, heatmaps, and histograms. And then we also saw how to customize and
style charts, how to make them beautiful. And we also saw how to plot images and
how to plot multiple charts in a grid. So all of these things
are all of these things. Then we tied together into our
today's lecture on exploratory data analysis, where we found a real
world data set, the stack overflow developer survey with 65,000 responses. We loaded the data, clean data,
pre-process data did exploratory analysis and visualization. Then we asked unanswered questions
and we made a bunch of inferences. And now you're working
on the course project. Where you will repeat this process on
a real world data set of your choice. So that was a quick recap of the course. Now, what should you do next? Try out the notebooks yourself. You can revise any of
the previous lectures. You can watch the videos. You can try out, you can run
the notebooks, just it's a one click away at any point. Definitely try out the stack
overflow survey results and some of the exercises there. And if you have any questions, you
can always ask questions on the forum. I've been saying this from the start. People who are active on the forum
are the most likely to complete. So if till now you've not checked out
the forum, just go to ZerotoPandas.com and there is a link to open up the forum. So here you can see here, there's a
course community discussion forum. Any question you have, we
have topics for each lecture. So any question that you have
just go to that topic and first search through that topic. It's likely that your question
has already been answered. And if not, you can always post a
new you can post a new post on that topic, just reply to that topic. And somebody will answer it. Like we have an amazing course community
and we've been seeing a huge contributions from people spending hours, just
answering questions from other people. So I just want to give
a big thank you to them. So please do participate
into the forum now. If you complete all the assignments and
the course project then, and if we, once we evaluate all of that and you get a pass
grade in all of them, then you will be issued a certificate of accomplishment. This will be a verified
link hosted on Jovian. So it will be a page on Jovian, a
part of your profile, where it can be displayed, and this will be available
for download as a PDF as well. So if you want to
download it, print it out. You can do that too, and you will
be able to add it on LinkedIn. Onto your LinkedIn profile so that
anybody who has at your profile will see that you have completed a certification
on data analysis that Python, and you will also be able to share, it online
on Twitter or Facebook or wherever. So we've all pertained a lot of effort. So we can definitely celebrate it once we
own the certificate of accomplishment, do share it and even encourage your friends
to take the course in future sessions. This is what the certificate looks
like, but this will be embedded into a webpage from where you will be able to
download it and share this page as well. The thing you should not do is
immediately do not jump to another course. Walk on a project make your, project
as large and as interesting as possible, because it's not enough
to say that you've done a course. And it's not enough to say
that you have a certification. It's not enough to just say that
you've done a small project. You should have a significant data
analysis project under your belt, and it should be something that you should
have documented and presented well, and it should be something that you have
put on your public profiles, right? So do something that you feel proud of. And then put it up on your public
profile, build, improve your professional profile and write blog posts, right? Your totals and write guides. You can do this on medium. You can do this on GitHub pages. There are a lot of platforms
where you can write blogs. You can use Jovian to share
your Jupyter notebooks. And I talk about that
a little bit as well. But do write any guides that can
help people who came before you. So look back at yourself and try
to write maybe a small tutorial for that person to encourage them to
demystify data science for them. Or maybe just to tell them that pandas
is not as scary as it might seem, or maybe just point them to this course and
say that you can learn about it here. So there are a lot of resources available
online for you that you now know. And so you can now curate them and
share them with your community, right? So if you're a student, share them with
your classmates, if you're a professional, share them with your colleagues. Share them within your company,
the more you share your knowledge, the better you get at it as well. Okay. And then improve your
professional profile. So showcase your certificate, showcase
your project on your LinkedIn profile on your guitar profile on your resume. So do that. And then once you feel like you've
really done a lot of work in on this topic, that is a point at which you
should then take more data science and machine learning courses, right? So do not fall into the trap of
just doing a bunch of courses without any real output out of them. The best way to learn is by
doing and doing good projects. Now you can use Jovian to build your
professional data science profile. And we are working on some very
interesting improvements to the profile, a lot of which we've already added. So if you go back to just open Jovian
and login, so if you just go to my profile, if you're logged in, so you
can see here that on your profile, you will be able to add more information. So on your profile, you will
see an edit button where you can add your current designation. Your you can add your
current university or. Your company, and you can
also link your GitHub profile. And here you can see a
collection of all the notebooks that you have created so far. So any Jupyter notebook that you create,
any time, any interesting analysis that you do just upload it to Jovian. So just do just run jovian.commit()
inside it, and it would get added to your profile. You can also upload notebooks directly. If you have a Jupyter notebook somewhere
on, let's say you have somewhere on GitHub or you have somewhere. You have some air on your
computer, you can upload that to, or you can import it from a URL. So you can do that. You can also create collections so
you can create collections where you see interesting notebooks joined
together into a single collection. For instance, I have this collection
on deep learning with . I have this collection on data analysis with where
I'm going to add a few more notebooks. So that's an interesting way
to organize your notebooks. And you also get access to the
forum as part of the profile. So try to make, try to answer questions on
the forum, because that is then going to also reflect on your professional profile. The more questions you've answered,
the more knowledgeable you are. So that's, what it is to do use
Jovian and all these projects that are hosted on Jovian. By default, you have this
really nice, beautiful optimized view of the Jupyter notebook. You can even, these are
mobile friendly views. So if you even open these on
mobile, You will still be able to they will load up really fast. We spend a lot of time just tuning
the performance of these pages. If you're sharing a link to a
Jupyter notebook, a lot of people are probably going to open it on
mobile and on mobile, this might not render well on different platforms. So please do use Jovian. It's a great, way to share this. And it's also a gateway to just show
all the work that you've done, because the version history of a notebook
shows how much work you've put in. So if you have a notebook with some 20
or 30 or 50 versions, and then you share that with somebody, then they can go
through it and they can see that you've really put in a lot of effort into this. So a lot of this, about your
professional profile, a lot of it is about just showing what you've done
now, making it visible for people to discover and learn more about, okay. And you can follow us on Twitter. So we are Jovian at Twitter. Keep tweeting, interesting notebooks. We keep tweeting interesting
resources for data science. In fact, if you tag us, if you
share your notebook online and tag us, then we will definitely try
and retweet interesting notebooks. We try and retweet four or
five interesting notebooks every week at the very least. So we hope you'll find Jovian useful. And on behalf of the entire course
team, I just want to say a big thank you for you or to you. For being so active on the forum for doing
all the assignments and working on the project, going through all the lectures
and just being, awesome overall we, were really excited to run this course and
yeah, and this is really not the end. We are hoping that we will continue
to have a long association with you. We have a lot of interesting
things planned for you. So I will see you in the forums. This is the end of our lectures,
but we will be able to interact with you via the forums. You can follow free code camp, follow
Jovian on Twitter, or follow me So with that, we come to the end
of Data Analysis with Python. Thanks a lot for joining and all the best
for those of you who are still working on the assignments on the course project. And I hope to see you
soon in a future course. Thank you and goodbye.