Welcome to our data analysis with Python tutorial.
My name is Santiago and I will be your instructor. This is a joint initiative between Free Code
Camp and remoter. In this tutorial, we'll explore the capabilities of Python on the
entire PI Data stack to perform data analysis, we'll learn how to read data from multiple
sources such as databases, CSV and Excel files, how to clean and transform it by applying
statistical functions and how to create beautiful visualizations will show you all the important
tools of the PI Data stack pandas, matplotlib, Seabourn and many others. This tutorial is
going to be useful both for Python beginners that want to learn how to manage data with
Python, and also traditional data analysts coming from Excel tableau, etc. You learn
how programming can power up your day to day analysis. So let's get started. Welcome to our data analysis with Python tutorial
My name is Santiago and I am an instructor@remoter.com an online Data Science Academy. This tutorial
is a result of a joint effort by remoter and Free Code Camp, and it's totally free. It
includes slides, Jupyter, notebooks and coding exercises. Let me tell you a little bit more
about remoter were an online hands on Data Science Academy. We specialize in data science,
including data analysis, programming and machine learning. We have a complete course catalog
and we're adding more content every month. If you're interested in learning data science
or data analysis, check us out. As part of this joint effort between Free Code Camp and
remoter you can get a 10% discount in your first month by using the following discount
coupon. Let's quickly review the contents of this tutorial. In the description of this
video, we have included direct links to each section, so you can jump between them. This
is the first section and we are going to discuss one is data analysis. We'll also talk about
data analysis with Python and why programming tools like Python SQL and pandas are important.
In the following section will show you a real example of data analysis using Python. So
you can see the power of it will not explain the tools in detail. It's just a quick demonstration
for you to understand what this tutorial is about. The following sections will be the
ones explaining each tool in detail, there are two more sections that I want to especially
point out. The first one is section number three Jupiter tutorial. This is not mandatory,
and you can skip it if you already know how to use Jupyter notebooks. Also the last section
Python in under 10 minutes. This is just a recap of Python. If you're coming from other
languages, you might want to take this first. If that's the case, again, you can use the
links in the video description to jump straight to it. All right now let's define what is
data analysis. I think the Wikipedia article summarizes perfectly the process of inspecting,
cleansing, transforming and modeling data with the goal of discovering useful information,
you forming conclusions and support decision making. Let's analyze this definition piece
by piece. The first part of the process of data analysis is usually tedious. It starts
by gathering the data and cleaning it and transforming it for further analysis. This
is where Python and the PI Data Tools Excel. We're going to be using pandas to read, clean
and transform our data. Modeling data means adapting real life scenarios to information
systems using inferential statistics to see if any pattern or model arise. For this we're
going to be using the statistical analysis features panelists and visualizations for
matplotlib and Seabourn. Once we have processed the data and created models out of it, we'll
try to drive conclusions from it finding interesting patterns or anomalies that might arise. The
word information here is key. We're trying to transform data into information. Our data
might be a huge list of all the purchases made in Walmart in the last year, the information
will be something like pop tarts sell better on Tuesdays. This is the final objective data
analysis we need to provide evidence of our findings, create a readable reports and dashboards
and aid other departments with the information we've gathered. Multiple actors will use your
analysis, marketing sales, accounting executives, etc. They might need to see a different view
of the same information. They might all need different reports or level of detail what
tools are available today for data analysis. We've broken these down into two main categories,
our managed tools, our close products, tools you can buy and start using right out of the
box. Excel is a good example. Tableau and luchar are probably the most popular ones
for data analysis. In the other extreme, we have what we call programming languages or
we Call them open tools. These are not sold by an individual vendor, but they are a combination
of languages open source libraries and products. Python R and Giulia are the most popular ones
in this category. Let's explore the advantages and disadvantages of them. The main advantage
of close tools like Tableau or Excel is that they are generally easy to learn. There is
a company writing documentation providing support and driving the creation of the product.
The biggest disadvantage is that the scope of the tool is limited, you can cross the
boundaries of it. In contrast, using Python and the universe of PI Data Tools gives you
amazing flexibility. Do you need to read data from a closed API using secret key authentication
for example, you can do it? Do you need to consume data directly from AWS kinases, you
can do it a programming language is the most powerful tool you can learn. Another important
advantage is a general scope of a programming language. What happens if Tableau for example,
goes out of business. Or if you just get bored from it and feel like your career is taught
you need a career change? learning how to process data, using a programming language
gives you freedom? The main disadvantage of a programming language is that it's not as
simple to learn as with a tool, you need to learn the basics of coding first, and it takes
time. Why are we choosing Python to do data analysis? Python is the best programming language
to learn to code. It's simple, intuitive, and unreadable. It includes 1000s of libraries
to do virtually anything from cryptography to IoT. Python is free and open source. That
means that there are 1000s of PI's very smart people seeing the internals of the language
under libraries. from Google to Bank of America, major institutions rely on Python every day,
which means that it's very hard for it to go away. Finally, Python has a great open
source spirit. The community is amazing, the documentation, so exhaustive, and there are
a lot of free tutorials around checkout for conferences in your area, it's very likely
that there is a local group of Python developers in your city. We couldn't be talking about
data analysis without mentioning r r is also a great programming language. We prefer Python
because it's easier to get started and more general in the libraries and tools it includes.
R has a huge library of statistical functions. And if you're in a highly technical discipline,
you should check it out. Let's quickly review the data analysis process. The process starts
by getting the data where is your data coming from? Usually it's in your own database, but
it could also come from files stored in a different format, or a web API. Once you've
collected the data, you will need to clean it. If the source of the data is your own
database, then it's probably in writing shape. If you're using more extreme sources like
web scraping, then the process will be more tedious. With your data clean, you'll now
need to rearrange and reshape the data for better analysis, transforming fields merging
tables, combining data from multiple sources, etc. The objective of this process to get
the data ready for the next step. The process of analysis involves extracting patterns from
the data that is now clean and in shape. Capturing trends or anomalies. statistical analysis
will be fundamental in this process. Finally, it's time to do something with data analysis.
If this was a data science project, we could be ready to implement machine learning models.
If we focus strictly on data analysis, we'll probably need to build reports communicate
our results, and support decision making. Let's finish by saying that in real life,
this process isn't so linear, we're usually jumping back and forth between the step and
it looks more like a cycle than a straight line. What is the difference between data
analysis and data science? The boundaries between data analysis and data science are
not very clear. The main differences are that data scientists usually have more programming
and math skills, they can then apply these skills in machine learning on ETL processes.
The analysts on the other hand, have a better communication skills creating better reports
with stronger storytelling abilities. By the way, these Weiler chart you're seeing right
here is available in the notes in case you want to check out the source code. Let's explore
the Python and PI Data ecosystem, all the tools and libraries that we will be using.
The most important libraries that we will be using are pandas for data analysis, and
matplotlib and Seabourn for visualizations. But the ecosystem is large and there are many
useful libraries for specific use cases. How do Python data analysts think if you're coming
from a traditional data analysis place using tools like Excel and Tableau you're probably
used to have a constant visual reference of your data. All these tools are point on Click.
This works great for a small amount of data. But it's less useful when the amount of records
grow. It's just impossible for humans to visually reference too much data, and the processing
gets incredibly slow. In contrast, when we work with Python, we don't have a constant
visual reference of the data we're working with. We know it's there. We know how it looks
like. We know the main statistical properties of it, but we're not constantly looking at
it. These allows us to work with millions of records incredibly fast. This also means
you can move your data analysis processes from one computer to the other, and for example,
to the cloud without much overhead. And finally, why would you like to add Python to your data
analysis skills aside from the advantages of freedom and power theories, another important
reason, according to PayScale, data analysts that no Python and SQL are better paid than
the ones that don't know how to use programming tools. So that's it. Let's get started in
our following section will show you a real world example of data analysis with Python,
we want you to see right away what you will be able to do after this tutorial. We're gonna start this tutorial by working
with a real example of data analysis and data processing with Python, we're not going to
get into the details yet, the following sections will explain what each one of the tools does,
and what is the best way to apply them combining and the details of them. In general, this
is just for you to have a quick on high level reference of our day to day processes, data
analysts, data managers, data scientist using Python. So the first data set that we're going
to use is a CSV file that has this form, you can find it right here, under the data directory,
the data we're going to be used is this, I have just transformed it into a spreadsheet.
So we can pretty much look at it from a more visual perspective. But remember, as we said
in the introduction, as data analysts are not constantly looking at the data, right,
we don't have a constant visual reference, we are more driven by the understanding of
the data right in the back of our head, and we understand how what the data looks like,
what's the shape of it. And that's what it's conducting our analysis. So the first thing
we're going to do is we're going to read it this CSV into Python, and you can see how
simple it is just one line of code gets us the CSV read into byte, then we're going to
give a quick reference. And this is what the data frame that we have created looks like
data frame is a special word is a special data structure, we use independent tool. And
again, we're going to see that in detail in the pan this part of this tutorial. The data
frame is pretty much the CSV representation, but it has a few more enforced things like
for example, each column has a strict data type. And we will not be able to change it
to tetra, it's a better way to conduct our analysis, the shape of our data frame tells
us how many rows and how many columns we have. So you can imagine that with these amount
of rows, it's not so simple to again, follow a visual representation of it's like, it's
pretty much infants crawling, in this point 100,000 rows. But the way we work is by immediately
after we load our data we have we want to find some sort of reference in the shape and
the the properties of the data we're working with. And for that we're going to do first
an info to quickly understand the columns we're working with. In this case, we have
date, which is a date time field, we have day, month year on that are just complimentary
to date, we have the customer age, which is uninjured, which makes sense right? age group,
you can say it's right here. It's age group youth, customer gender, we have an idea again,
of the of the entire data set, we know the columns we have, but we also know how large
it is. And we don't care what's in between, we will be cleaning it probably, but we don't
need to actually start looking row per row, right just with our very limited eyes, we
have a better understanding of the structure of our data in this way. And we're going one
step further, we will also have a better understanding of the statistical properties of this data
frame with a describe method. For all those numeric fields, I can have an idea of the statistical properties of those. So
for example, I know that the average age of these data set is 35 years old. I also know
that the maximum age in this case if these Or is the sales data is 87 years old, I know
the minimum is 17 years old. And again, I can start building right if my understanding
of this that physical properties of it. So in this case, the median of my age is very
close to the mean. So this is telling me, all is telling me something, and the same
thing is going to happen for each one of the columns that we are using. For example, we have a negative profit here,
and we have very large values here are these correct, is maybe there's a mistake, again,
it's by having a quick statistical view of our data, we're going to be driving the process
of analysis without the need of constantly looking at all the rows that we have. It's
a, it's a more general holistic overview. So we're gonna start with unit cost, let's,
let's see what it looks like. And we're going to do a describe only if you need coast, which
is pretty much what we had right here. In the previous in this line, what we did was
for the entire data frame for the entire data, in this case, we're just focusing in the unit
coast, cost, sorry, column, the mean, the median, all fields, we know already pretty
much from this, and we're gonna quickly plot them, we're going to use these tools to visualize
them. And it's the same tool, it's paying this that it's using on top, right? It's using
matplotlib. So the visualization is created with matplotlib. But we're doing it directly
from pandas. And again, don't worry, this is all explained in pandas lessons. So this
is unit costs, right is what this is the box, but we have just created, we have the whiskers
that mean that shows us the the first and third quartile, the median. And then we see
all the outliers that we have right here. So we see that our product study is around
$500 is considered to be an outlier. And the same thing if we do a density plot, right.
So this is what it looks like. We're going to draw two more charts, right, in which we're
going to pretty much point out the mean and the median, right in the distribution charts.
And we're going to do a quick histogram of the costs of our products. Moving forward,
we're going to talk about age groups with the age of a customer. And at any moment,
we can always do something like sales sort here to give a quick reference, we know that
the the age of the customer is expressed in actual years old they were but also they have
been categorized with three, four, actually four age groups, seniors, youth, young adults
and adults, right. So they we have given categories were creative, right to better understand
these groups, and we do that with values. Value counts, we can quickly get a pie chart
out of it, or we could get a bar chart out of it. As you can see, right here, we're doing
an analysis of our data, we see that adults right here are the largest group in our for
our data at least. So moving forward, what about a correlation analysis? What is a correlation
between some of our properties, we will probably have high correlation for example, between
profit and unit cost, for example, or order quantity, that's kind of expected, but that's
all something that we can do right here. This is matrix right of correlation showing in
red high correlation. So order quantity, and unit cost or where is profit right here. Profit
is right here. So we see high correlation with unit with cost with profit. Now with
profit, actually, it's the opposite blue is high correlation, I'm sorry, the diagonal,
which is blue, is correlation is equals one. So high correlation is blue. And we see that
profit has huge correlate has a lot of correlation, positive correlation with unit cost and unit
price. And negative correlation is with dark red. So we again can have a quick idea. Let's
see, for example, here profit, it has negative correlation with order quantity, which is
interesting, right? It's we wouldn't dig deeper into that, of course, the profit has a high
correlation positive with revenue, right? And again, it's just a quick correlation analysis.
We can also do a quick scatterplot to analyze the customer age and the revenue right to
see if there is any, any correlation there. Right? And the same thing for revenue and
profit. This is obvious, right? We can we can quickly draw a diagonal here, right. So
there is a lot Linear depth and dependency between these variables. So a form a few more
box plots, in this case, understanding the profit per age group, right, so we can see
how the profit will be, will change depending of the customer's age, and a few more box
plots. And we're creating these these grid of year customer age, unit costs, etc, for
multiple things. So moving forward, something that we can quickly do when we're working
with Python, especially within this is Drew shape or data or derive it from other columns,
right. So this is pretty common in Excel, we can create these revenue per age column,
if you're here in Google spreadsheets, you're going to do something like revenue, per age,
and you're going to do something like equals, right? Equals revenue, divided, I don't remember
if this correct formula we're using, but just for, for you to have a reference. And we're
going to pretty much extend this whole thing. There we go, Oh, well is processing, and I
have 100,000 rows. So you can see how slow it is, I let's compare that just to the way
Python works, I'm gonna execute this thing. It was instant, you know, extremely fast.
And it was all calculated seems that we have the same results as expected. same results
as expected. And we can quickly plot both the in a density plot and in a histogram,
as you can see, right there, now that revenue parade is going to be relevant. In any case,
it's just to show you the capabilities of what we can do. Let's annual analyze, well,
we're gonna create a new column, which is calculated cost is the total, the total orders
the total, the quantity of the order, times the cost, right, extremely simple formula,
very fast process. And we're gonna get right here, how many rows had a different value
than what was provided by cost? So what we're doing right here is like, we're quickly checking
if the cost provided by the data set, at some point doesn't align with the actual cost we
are calculating. So is there any mistakes that were made by the I don't know the original
system, or people doing a data entry, if these new column is different from cost, we want
to know about that. And that doesn't happen. So again, quick, quick, regression plot. In
this case, it's very obvious that there is some linear dependency between calculate cost
and profit. So more formulas, in this case costs part cost plus profit. So we're going
to adding a little bit more, there is no difference with the revenue and the calculated revenue
that we are having. So that all makes sense, we're going to do a quick histogram of the
revenue. We can, for example, on 3%, to all the prices that we are using, we need to increase
prices. How are you going to do that? Well, it's very simple with Python, we're just going
to do increase everything by point 03. And now all the prices have changed. What else we're going to be able to do quick
filtering, let's get all the sales from the state of Kentucky right. So these are all
the sales from the state of Kentucky, we can get only the average of the sales by these
age group on only revenue, right. So these, all these filtering options, and extremely
simple to get with Python. In this case, we say, give me all the sales from these age
group, and also from this country, right, and we're gonna get the average revenue from
these groups that we are selecting. And again, to modify the data, we can make just a few
quick modifications, like in this case, we're going to say, all the sales from country right
to revenue, we're going to increase it by 1.1. I don't know why, which is doing it arbitrarily.
It's just for me to show you how it works. So far, so good. Again, we've done a couple
things, you don't need to know about the details, we will actually go through that in the NumPy
independence sections in this tutorial. So just for you to have a quick reference of
it. There are exercises associated with these given lectures. So if you want to pause right
now and get into the exercises, that's going to be very helpful. We're going to move forward
now with the second lecture in which we will be using a database this Akila database and
we're going Be erasing data, instead of from a CSV file, as we did before, we're going
to read data now from a database. Reading data from a SQL database is as simple as it
is from an Excel file or a CSV file, as we were doing with our previous example. And
once you've read the data, that's we're going to do now the process is the same. So what
we have right here is a query a SQL query, if you don't know about SQL, you can check
our courses or other courses online. Basically, we're pulling the data from the database.
This is one of the advantages of Python, it's not, there are connectors for pretty much
every database provided out there, Oracle, Postgres, MySQL, SQL Server, etc. In this
particular example, we're going to be using MySQL. So once you construct the query, and
you pull the data from the database, then the process is the same, we have just converted
these outside data into a data frame that we can use with our Python skills. The first
step, as usual, is to check the shape information description of our data of our data frame.
In this case, we want to, again understand the structure of it. So we want to know how
many rows we have 16,000, we want to know a little bit more about our rows, we want
to know about a little bit more about our columns, and how many rows how many records
we have for each one of them and the type of each one of these columns. And we also
want to have a better statistical understanding of our data. So we do a quick describe, and
we have more details about it. If we want to focus in individual columns, right, we
can just do that by in this case, we're gonna focus in film rental rate, right, pretty much
how much you pay to rent a film. Um, we're gonna see the kind of distribution we have,
we can call it distribution, it's pretty much a categorical field in this case, but basically,
the rentals are divided into three main categories are prices, zero 99 299 499. So that's these
box plot these pretty much perfect, never seen in real life plot box plot gives you
those prices. And move forward, we can also check very quickly a categorical analysis,
understanding the distribution of rentals between cities, so we have two cities. And
it's pretty much even as you can see right here, creating new columns and reshaping the
data for further analysis, etc, is relatively simple. In this case, we're going to analyze
their return in rentals, right, which, which films are going to be more profitable for
the company div, dividing the rental rate, how much we charge, divided by the cost, how
much it costs us to acquire the film. So in this case, we can see the distribution of
that, right. So most rentals are here in the beginning. And then we have more profitable
rentals, were making up to 60% above the rental. And we can quickly analyze the mean and the
median fit right to have a quick idea of all that. Finally, selection and indexing, if you want
to start focusing, if you want to go into data, right, you want to zoom in, you want
to have a better understanding. So you start filtering, in this case, we can filter by
customer, but if you want to do it per city, if you want to do it per state, if you want
to do it per film, per price category, etc. It's very simple to filter to filter and zooming,
which is one particular characteristic of your data. So you can perform a more detailed
analysis. So in this case, we have all the the films are rented by the customer last
name, Hanson, which doesn't mean it's the same person. But again, it's very simple to
filter dot. And here, we can do we can very quickly see which ones are the price, the
film's sorry, that have the highest replacement cost, right. So basically, what we're doing
is we're going to isolate those films that have the highest replacement cost. And also
we can see right here just for you to have an idea, all the films that are in the category
PG or pG 13. It's very simple to to filter that data. So this is the process we usually
follow. we imported the data, we reshape it somehow create columns, there is an important
process of cleaning up or not highlighting this part of the tutorial, we're going to
talk about it in the tutorial itself. There's the process of cleaning, then reshaping creating
new columns, combining data and creating visualizations. This is the process, right? We're following
here with our Python skills, but it's a tone more to odd as you might imagine, from creating
reports to running machine learning processes, creating linear regressions, etc. For now,
this is just a quick understanding of the process. We follow. Now starting now we're
gonna move forward with more details of each one of the individual tools we're going to
talk about. We're going to talk about Jupyter notebooks. We're going to talk about NumPy.
We're going to talk about pandas, we're going to talk about mapa, lib, seaborne, etc. Starting
now, right? The first thing we're going to see is, what is this whole thing that I've
been using this Jupyter Notebook, I want you to now too, if you want, if you if you don't
have experience with it, I want you to have an idea of how it works. And then we're going
to move forward the individual tools, NumPy, pandas, etc. Remember, there are exercises
also associated with this particular lecture. So you can always go back again, and work
with them. Once you get more a better understanding of the tools we are using. Before we jump into the actual data analysis
course, and we start talking about Python, pandas, all the tools, we're going to use
import files, read data from databases, etc, I want to show you the environment that we
work with. It's our primary environment, it's the tool that we use 99% of the time on its
Jupyter Notebook, there are going to be different terms here, I'm going to be referring to it
as Jupyter Notebook. But as you are going to see in this, in this part of the of our
tutorial, you can see that Jupiter is actually a whole ecosystem of tools. And it's a very
interesting project. Jupiter is a free and open source, again, ecosystem of multiple
tools. And primarily, we're gonna talk about first, what is a Jupyter Notebook. What you're
seeing right here, and you're gonna see live in a second, I can actually show it to you
is this thing we're going to use. And we are also going to talk about Jupiter lab. Okay,
which is the evolution of the regular Jupyter Notebook. So, I think this could be familiar
to you already. Usually the questions in the question is, what's the difference between
Jupyter Notebook and Jupiter lab? Well, the difference is that Jupiter lab is just a nicer
interface on top of Jupyter notebooks. It's not just the plain notebook. This is a notebook,
but I'm scrolling right now. It's also the addition of tree view, it's an addition of
get tools, as an addition of command to lead and multiple other things. You can open some
files with a nice preview in it, etc. So, Jupiter lab Jupyter Notebook, they are similar
Jupiter lab easy, again, the evolution of a Jupyter Notebook. And that's what we're
using. Again, Jupiter is a free and open source project. So anybody can install it, anybody
can download it, it's very simple to get it set up in your local computer. In this case,
we're using something we call notebooks AI, it's a project that provides Jupiter environment
for free in the cloud. So you don't need to install things locally, you don't need to
put things in sync in your own hard drive, right you That means you don't need to buck
it up, for example, because it's just a service, it's all worked in the cloud. So said that,
I want to tell you that we have compiled a very quick list of everything, we're going
to talk in this part of the tutorial, in this list of two, it's just a thread of with multiple,
multiple hints of how to use Jupyter notebooks. So after the video after the course, if you
forget some of these concepts, you can always go back to this to it, it's a quick reference
for you to have. So let's get started. Why do we use a Jupyter Notebook? Because it's
an interactive real time environment to produce our or to to explore our data and to do our
data analysis. It's a tool you're gonna fire commands, and it will immediately respond
with something back. It's a very interactive tool, when we're working with data analysis,
and this is mainly main difference with some other tools like for example, Excel, tableau,
etc, is that we are not constantly looking at the data, there is no visual reference,
like for example you have in Excel, right? So in Excel, you're constantly looking at
the data, you have it in front of you, there are 100,000 cells and you can stroll and see
them. The problem is that that's not scalable, right? It's like nobody can work with 100,000
rows in their, in their, in their mind, we will always forget something. So the way we
work with Python indeed, analysis is by always having a reference of how our data looks like
but always at the back of our head and we're not constantly looking at it. We're like this
person from the matrix, you know, the, the the commander of the matrix that commands
people to get get in and out. We're basically telling people telling people that basically
asking data, right asking questions to the data, and having a picture in our mind of
how that's going to work, we're not constantly looking at it, we're just having a reference,
or in our in the back of our heads of what our data looks like. So that's why this tool
is very useful. This tool is useful Also, if you're just training your Python skills,
and or their permanent language skills, because what you're gonna see is it's just a regular
Python interpreter. In this case, I can execute some code, that's two one times, actually
one plus three, there we go. And the result is four. Right. So this is a Python is a fully
featured Python interpreter. The good thing is that again, it's going to respond to us
pretty much immediately I create a command and I immediately get a response, I can do
something a print here, hello world. And I immediately get a response, I can do Hello,
world, times, times three. Again, it's a again, a Python interpreter, a fully feature Python
interpreter, but it's not being accessed from a terminal, which you can write this is the
good thing about Jupiter lab to have a terminal, I can do Python, right. And I can do two,
time three, and I get an answer back. But this is not convenient to work with our data,
we need something a little bit more interactive, we can also mix with documents, that's going
to be the advantage of a Jupyter. notebook. So what what's the way we work with Jupyter
notebooks, there are a few concepts, very important concepts that we are going to follow
a Jupyter Notebook is just a sequence of multiple cells, okay, everything is a cell. And as
you can see, when I click on these cells, even if even if it doesn't look like being
a cell, it is, you will see that these blue thing right here, right is pretty much following
me because I'm clicking on the cell, and I'm selecting that particular cell. Everything
happens within a cell, if I want to execute some code I can do, again, one plus five,
and to get a result or a result back, right, that's, that's how it works. So I'm creating
a cell, I'm deleting a cell, I create another cell again. So it's everything happens with
a cell, and I'm going to tell you how to add the cells, how to remove them how to execute
code, etc. The interesting thing about a cell is that it can either be Python code, or any
other programming language you're using in this case is a Python data analysis course.
It can be Python code, as we're we were doing before one plus three, this is Python code,
or it can be what we call markdown, okay, which is a formatting format, right? To create text, that will be
a render with sort of HTML ID at the output. So in this case, this is what the source code
of the markdown looks like in markdown, any line that starts with this part, it's going
to be a title, in this case, it's going to be the largest, the biggest title you can
have is just one pod, and then you keep adding to reviews the size in this case, level three
title. And then you can have for example, this is a quote this is bold, this is it Alex,
this is a link, right? So let me actually, I could copy the cell and open the source
code. There we go. So this is a link right issue, issue is created or it's rendered as
a link. So markdown, what is is that is a text formatting tool, right or protocol, we
could say that in this case, we just specify us we have some some rules to use in our in our text, and
markdown knows how to interpret them and format right or return a formatted document after
them. So for example, here, we have green divider, which is a picture and we know it's
a picture because it starts with an exclamation marks. And that's that what you're saying
right here. So again, a cell can be either Python code, or it can be markdown. markdown
is an entire thing on its own. You can get any tutorial online free, it's it's fairly
simple to get started with. And it's also very important because when you're formatting
your reports, right, when you're creating your reports, you want them to look pretty,
you can use markdown for not and what we're going to see later So you can export these
notebooks and they will generate PDFs, right. So this whole thing can be a PDF or an or
an HTML page. So after you're done with your data analysis, you can hand over to whoever
asked for the analysis, a PDF report, which is pretty neat. So moving forward, again,
any cell is going to be either markdown, or it's going to be code right here. So these
ones code, and you can switch the modes, you can say, this LS code, or actually, let's
make it markdown. So right now, if it's a code, it doesn't doesn't matter, or just,
it's not executing anything, because the cell is interpret as markdown. So now, I'm switched
back to code. And now it works. Again, I said, Sure. It can also be raw, but to be honest,
we don't use raw very often. So again, you have this this general cell type, this cell
we're using, what type is it? Is it code is it markdown, you can switch it with these
with the selector right here. So a few more things that I have to tell you right away,
so you can start internalizing them, and it's gonna take some time to get used to it. But
once you get used to it, you're gonna move very fast in your data analysis with Python
Jupyter notebooks. The first thing is, as you're seeing right here, every cell has been
given an execution number. So any, the cells will be moved, right, they will be moving
around, you will be moving them around. But you will always know which one executed before
another one. And that's because every execution, you run will be assigned an execution number.
In this case, this is the seventh time I have executed code. If I execute code again, for
example, I don't know, two times two, this is the eighth time that I've executed code.
And if I move this thing, right here, if you're reading this thing, top down, you will not
be full, right? You will understand this thing. It was moved, the cell was moved, the structure
of the notebook changed. But these thing was executed after this other cell, right? xact.
And this is seven. So the execution order is always preserved. So that's an important
thing. Something else that you're seeing me change the structure, and do things with the
notebook without using any menu. And that's because I know how to use keyboard commands
keyword shortcuts to run most of these commands. So for example, how can I add a new cell I
have these is a markdown cell. This is a code cell, if I need a cell before these one, what's
what's that command that I'm going to issue in order to create the cell, in this case,
the command is going to be the letter A, I just type A, and there is a new cell creative.
How can I delete the cell, it can be two times that the key two times the D key. And again,
this is all these reference with built. So for example, right here, whereas hit at some
point, you can. Here, you can type, you can press A to create
a new cell, you can press B to create a new cell, what we call below. So let me put something
here, this is a reference. And I'm going to put here the letter B and it's going to create
a cell B below the currently selected one. So the selection here is here in the blue,
I hit let me delete this one, I hit B. And again, it's going to create a cell below the
previously selected one, if I hit a, it's going to create a cell above that previously
created one. So these, these are the mnemonics of the creation. Something else and it's very
important why when I'm in this cell, and I hit the letter, a leader, literally it just
hits the letter A in my keyword, no control, no command, just a, it creates a new cell,
and it doesn't type A inside the document, right? So right here, if I type A, it's adding
an actual a character in the cell. Why didn't that happen before. And you're going to notice
that when I change, when I'm going to call a mode in a second, you're going to see that
the content of the cell is grayed out, show what now when I when I press on the letter
A it actually creates to sell and it's not adding content to the sell itself. If I go
back again to the other mode, and I'm going to give you a better explanation in a second.
If I type anything, in this case, a it's actually appended to the text within it. So this is
my interaction to sell modes and this is very important. The Jupyter Notebook is a mode
base editor, right? So there are multiple editors are, for example, vim or VI, vi, those
are mode based editors, which basically, the behavior of your work will change depending
on the mode that it's currently activated. So for example, in this case, I am in addition
mode, because any character that I type will be appended to the cell, A, B, C, D, etc.
If I switch out of editing mode to what we're gonna call command mode, I switch out of that
mode. Now the cell is grayed out, and any key that I hit, it's gonna do something different
associated with that key. So A is going to create a new cell above, B is going to create
a new cell below, Double D is going to delete this cell, right. So that's, that's the important
part of Mo. That's one of the most important parts in order to understand how to work with
Jupyter notebooks, the mode that you're currently working with, and there are only two modes,
so it's fairly simple. This is command mode. And we recognize command mode, because this
cell is grayed out. When we get into edit mode, there is a regular prompt, as you're
saying before, the number one on the cell is actually subjects of addition. So that's
the way we can realize that, how are you going to switch from modes, in this case, I'm in
editing mode, if I'm using my mouse just pointing, I can click outside, I'm gonna get out of
the edit mode into command mode. If I point inside and going back again, to the Edit Mode,
but let me tell you something right away and then say, we don't like to use our mouse,
we don't like to point and click, because that's very slow. We like to use our keyboard,
we move very fast with our keyboard. So how are you going to switch from, from editing
mode back to command mode, that's going to be with the Escape key to go from editing
to command, edit as Escape key, it's going to switch out of editing, but when mode. And
if you actually want to make modifications to the cell, basically, you want to get into
edit mode, you're going to hit the return key, that's going to get you into edit mode,
again. So we have tackle multiple things are writing, again, we said in Jupyter notebooks,
we're going to use Python code very quickly to interact with our data, we need a real
time, you know, I'm asking a you're answering type of editor. That's what the Jupyter Notebook
is. The Jupyter Notebook has these two modes, edit and, and command mode. And then the cells
which is pretty much everything is the most important, it's a fundamental part of the
notebook, the cell is going to have two types can be either code, or it's going to be markdown,
right. And now I'm going to start showing you more features. And I'm going to show you,
I'm going to show you the most important commands. And of course, how the what the keyboard shortcuts
for those commands are, so you can move freely. And and and work with Jupyter Notebooks in
the most efficient way. So let's get started. First of all, for for from the most important
commands is moving right. So navigating, it's very simple to navigate, just use your arrow
keys, up and down, up and down. And you're going to move around in your notebook. If
you wanted to switch the type, right going from markdown to code, etc, you can switch
use these drop down or you can press the specific key is to switch to either markdown or Python.
So for my markdown, you're gonna switch sorry, hit the M key, that's going to make it markdown.
For Python, you're going to hit the Y key, that's going to make it Python code. So M
and y are going to switch you back and forth. Keep an eye on the selector you're going to
hit y m y m is going to switch it from code to markdown. What else how can you execute code once you
are within your typing code and you want to execute it, there are two types of executions
you can run. The first one is going to keep the selection the currently selected an active
cell is going to stay the same place you are and that's going to be my by keeping press
the Ctrl key and hitting return that's going to run decode on the cell there the prompt
or the current selected cell will remain being the same. So I'm running this thing a couple
of times already on this selection or the currently highlighted cell stays the same,
I can change that by using shift return. So I keep the shift key pressed. And I hit return
and is going to execute the code. But it will immediately switch the prompt or the currently
selected cell to the following one. And that's useful when you have multiple cells, you want
to execute one after the other. So you can keep hitting shift, return, return, return
return, and it keeps you moving right from top to bottom. Alright, so Ctrl return or
shift return to change the execution is the same is just what's going to happen with the
currently selected cell. We already saw how to create cells with the A key, we create
a cell above with B key we create a cell below. To delete a cell, you're going to hit the
D key, the D key two times one after the other very quickly, dd is going to delete these
the cell. What happens if you made a mistake, and you want to undo the previously issued
commands? Well, the mnemonic here is going to be Ctrl Z, you know the mnemonic, it's
not the command, it's going to be Ctrl Z, you only need to press the Z key, you know,
you don't need Ctrl Z, and it's gonna undo whatever you did in your previous command.
Alright, so a B, D deletion, and then Z to undo the all the commands were saying they
all have a correspondence in this toolbar or in this command palette. So for example,
right here, I could run this code by pressing these play button right here you see it, the
execution is changing. There are multiple ones and you can search them if you don't
remember right here. And the neat thing about it is that you actually have the shortcuts
to issue the same command. So let's say you don't remember how execute and stay stay in
the same cell, or move whatever you can search for run. And you can see what's the name,
and what's the actual command that you have, right there, right. So you can, at least for
your first ad or a month working with Jupyter notebooks, you will usually need to go back
to these commands, and try to remember the the quick shortcuts. And with time and practice,
those will just come naturally. So moving forward, what else, we have a few other commands,
in this case, we have something to cut and paste the cell somewhere else, just cut and
paste, that's going to be x to cut it, or you can also use the scissors here, x to cut
it. And to paste it, you can use this button or actually these buttons sorry, or you can
just press the V key V is going to paste it wherever you're currently standing it. So
I'm going to cut it, I'm going to remove it from here, and I'm going to paste it below
there. Or you can also copy it. So instead of cutting it, you can press the C key just
going to cut, sorry, copy. And then you can actually say where you want to paste it. In
this case, we have duplicated the same cell. And it looks something interesting here, the
execution count remains the same. So again, there is like this unique identifier for your
executions, which means that you know, when and where something was executed. Moving forward,
we're going to use some code here, we're going to import some tools, you can see some characteristics
or advantages of Jupyter notebooks and why we use it so often compared to, for example,
the regular Python terminal. One very important thing is visualizations,
we as data analyst, we're constantly getting data on expressing it through images, or animated
animations, right. But most commonly, images. The main library we use in Python is model
live. And model lib is a first class citizen in Jupyter notebooks, which means that you
can just run the figures from matplotlib. And they will just show up directly in your
notebook without the need of doing anything. Crazy. So can you imagine showing these these
beautiful picture in this terminal? That's that's very hard, of course. So again, that's
one of the main advantages of a Jupyter Notebook. Moving forward, what we're going to do is
we're going to first we're going to get some data from a public API. So there is these
crypto watch service, which basically has crypto information, Bitcoin, ether, etc. And
you can check the docs, we can actually open them. It's gonna give you market data Tesla.
You can check the docs and How you can get in this case it's BTC Bitcoin to euro, sexual
see if we can change it to USD USD price. There we go. So this is the current price
of bitcoin results, surprise, etc. And we're actually going to do markets do we have crack
and BTC USD, let's do, let's actually issue the same query we're going to use which is
open high, low, close Oh h LC. And don't worry, this looks ugly. But this is actually what
we're using. There's a list of results write for all different candles, we call them, we
get the idea of the open price, close price, high price and low price. So we're going to
issue those, we're going to issue these requests to the internet to these API, the crypto the
crypto watch API, so you can get information about bacon to do some analysis, I say they
can, you can actually get it from ether for for ether for author different types of crypto
or currencies. So the function we're defining is get history, get historic price, it's a
very simple function that uses pandas is one of the most important tools, we're going to
be using this course. And the requests library, which is also very famous library for Python.
And what we're going to do here is we're going to get Bitcoin on ether prize for an entire
week. Right. So from ferreted that the second February sorry, February 25, up to today,
right? So depending on when I'm shooting this video, and we're gonna get a quick reference
of the prices open, high, low, close. So in this case, we have four information per hour.
Okay, so this is something you can actually change in the in the, in the request you're
making to the API, you can reuse the candles eyes. In this case, we're keeping it per hour.
So we have by the hour information about Bitcoin, in this particular market, which is bitstamp.
Here, we have these day these day, and these are right, when I'm in the morning, open,
close, highest price and lowest price, and also the volume that was operated within this
time period. And we're gonna immediately plot the price. So we see that in these time, which
I think is an entire day, we the price dropped, it's actually a few days, like an entire week,
the price dropped from $9,600 below, right 9000. So it was a pretty significant drop.
Let's see ether highperformance. We have here all the records, and how it moved. So this
is what I tell you that when you're doing data analysis with programming tool like Python
rar, you're not constantly looking at the data. So what I'm showing you right here are
the first five records, we actually have. Let's do that. We actually have 169. Records,
okay, 169 Records. And this is per hour. So if we do 169 hours divided by 24 hours, we
have seven days, right? So we have seven days of data 169 Records, and then we have a little
bit more information keeps this to go. I'm gonna get to that in a second. But basically,
this is one I tell you 169 Records, to be honest, something you could be saying in a
spreadsheet. But I want you to get the concept here. We're not just looking at our data,
we have it in our brain, we know what did it we know what shape it has. We know how
many records it had, we know information standard deviation, what's the mean of that? Right?
So close price was the standard deviation, right? What's the the average, the mean, the
median, right? So we have information about our data. It's sitting behind, you know, in
our brain, but we're not looking at it. And that's because with a very simple example,
with only 169 Records, but in real life, we're dealing with millions of records, so it's
impossible to see it. Have you ever tried is crawling in an Excel spreadsheet through
millions of records. It's crazy. It's not possible. It's just unusable. So that's again,
the way we work with data analysis in Python and R and other tools. We don't constantly
keep an eye on the data. We know the shape of it. And we just have these quick references
like show me the first five records. I mean, the last five records, show me this chunk
here down there, but that's it. So again, these are the visualizations we're creating
on Jupyter notebooks. Again, it's just very simple to get the plot done right there. We're
going to also see in Jupyter notebooks, a few other pretty neat things. The first one
is that we can use another library, which is called bokeem. And the difference is that
boakye will have charts that are interactive. So I'm moving it right here, it has JavaScript.
And it's interactive, you look back again, to what we had here. This is a static chart,
it's just a PNG, you can actually export it as a PNG, there is nothing you can do with
it. With bokeem, it's actually a dynamic, dynamically generated interactive charts.
So I can, I can zoom in piece of data, right, I can move it around, I can just do whatever
I want with it. I can refresh and reset it to whatever it was. And it's a dynamically
generated chart. The difference is, if you're working with data, dynamically in your analysis,
sort of in your exploration, then boek is a planning tool because you can zoom in, right,
so what's going on here, let's, let's look at these things. If we're working on a mean,
reverting strategy, for example, we see a high volume, we see a low volume, the mean
is going to be here. So we see some mean reversion in there. It's very interesting. If you need
to, for example, export a PDF, export a huge HTML file, then static images are going to
be probably better. So that's the difference between them. To be honest, model lib is a
lot more popular than bogey, we use model live a lot more because it's we actually have
a few other tools like seaborne that make it very easy to access and use it. What else
Jupyter Notebooks work very well with some Excel, Excel files with all the file formats
csvs, XML, Excel files, etc. And that's also the the availability of Jupiter lab. So Jupiter
lab can immediately interpret and opens his v files can open with some extensions, XLS
files, XML files, JSON files has a very nice editor and tree view for Jason. So the Jupiter
lab environment combined with Python Jupyter Notebooks will give you a good idea of Jupiter
in general. So in this case, we have just saved I'm not going to execute these you can
try it out. But you can execute and run what we have just done and export this crypto file
us an Excel spreadsheet. So you can just click on here and you can basically download it,
you're going to open it and see what has There we go. So let me reduce the size of this thing. There
we go. So you can see that we have just exported to spread two sheets, in this case, Bitcoin
on ether, right? With the data that we had in our previous notebook, right. So that's
all again, the combination of Jupiter, the combination of Python and the combination
of Jupiter lab, which are tools just work very well together. So we're gonna keep moving
forward, in this video, this tutorial, I'm talking about more data analysis, in general,
we're going to talk about Python, we're going to do a quick review of Python. Maybe when
we when I was running these commands, you felt you felt a little bit lost what I was
doing with it. So we're gonna do a quick review of Python and all that. And of course, we're
gonna get directly deep into data analysis with pandas with some other tools, I want
to tell you something before we finish this chapter. And it's not, it's very important
for you to get familiar with data analysis, with sorry, with Jupyter notebooks, because
you're going to spend a ton of time with it. And it's a very, very valuable skill that
you can get if you get proficient, comfortable with Jupyter notebooks, you know, like creating
cells, deleting cells, cutting, pasting, moving things around, etc. For you to generate reports
Jupyter notebooks are going to be excellent. So keep an eye on it. Keep practicing, it's
the only way to learn it to the to the analysis. Keep practicing it, keep open the command
Polat. So you can always want if you forgot, how can it caught a cell? Well, there is here
it is command x, right? It's gonna just tell you upfront, keep an eye on it, keep working
with it and practicing it. And once you get familiar with Jupyter notebooks, you're going
to move very, very fast. Remember, they have these nice list of compiled commands and reference
you can always access if you need extra help. And we're going to keep moving forward now
with more data analysis. Now it's time to talk about NumPy, one of
the most important libraries in the Python ecosystem for data processing. In general,
it's the one that got pretty much everything started. And if you trace back NumPy, it,
it's a very old developed library. 20 years, maybe it's it's an extremely popular library
and important library, I'm not gonna say popular. And I'm going to explain why in just a second.
But it's a very, very important library in the Python ecosystem for data processing.
NumPy is a library that will lead you it's a numeric competing library, it's just to
process numbers to calculate things with numbers. And that's it. So NumPy has a very limited
scope, we could say, and this is an on purpose, a very simple library, when you look at it,
and when you look at the API, which is very consistent, by the way, why is NumPy so important?
Well, in Python, numeric processing, and just pure Python processing numbers is very slow.
Okay, Python is not slow as itself compared to other programming languages. But when you
go down, right to very deep levels of performance, when you are processing large amounts of data,
right, and you need to squeeze, even, you know, that tiny bite at the end of your pipeline,
you need to squeeze every flow up from your CPU, then Python is not the right tool for
non Python as as a pure python programming language. NumPy is actually solving that NumPy
is a very efficient numeric processing library that sits on top of Python, and gives you
the same API as you're going to work with with just writing Python code, as you're saying
here. But low level, it's going to be using high performance, numeric computations and,
and arrays of numbers and representations, etc. That's it. That's it for pi NumPy. It's
extremely simple from from an API perspective, but it's extremely powerful. Why did I say
that? It's not so popular. But yes, it's so important. Well, because in reality, we don't
usually employ NumPy directly, you will not see yourself using NumPy. So often, but you
will be using other tools in Python, like for example, pandas, and matplotlib. And they
are all working on top of NumPy. They're all relying on relying on NumPy for their numeric
processing. So that's why NumPy is so important. So the for, at least for this part of the
tutorial NumPy. I'm going to divide it into pieces. The first one is going to be a very
detail, low level explanation of how NumPy works, why we need to use NumPy. And what
are the differences between different bite sizes for numbers, we're going to talk about
integers. But this is going to apply for decimals and data types also. And why you need a very
low level, optimize to us number. Now you can, you can skip this part, you're going
to find in the description of this tutorial, the precise moment in time. So you can just
skip and go directly to the second part, which is when we actually start using NumPy. And
I show you how to create arrays, how to make computations, etc. So for now, we're going
to divide it in two parts, we're going to start first with the low level explanation
which you can escape if you want, because it's not going to be crucial, you can easily
use NumPy. Without it. We have found that for some of our students, it's it's important
to understand the low level basics of it, especially if you didn't have a computer science
background, it can help you get you know, raise your right your level of understanding
of computers, and how to make your computations more efficient. But don't worry if you if
you don't want to go through that now it's fine. You can skip this part and come back
later or any other at any other moment. You don't need the ease to use NumPy seriously,
you don't need it. It's going to be beneficial, but you don't absolutely lead so you can just
skip and come later. So with that said, let's actually go into into a deep understanding and explanation of how computers
store integers, numbers in memory and what are bytes bits etc. In order to understand
why NumPy is so important. We have to go back again to the basics. What are numbers, how
they are represented in computers, etc. As you might know already a computer can only
process ones and zeros bits, it can't process numbers or just decimal numbers to be more
correct, sorry, it only can process ones and zeros. A computer is just always storing and
processing ones and zeros. It's a binary machine. Your memory is the central place around the
random access memory in your computer is the the central place where your computer is storing
the data that it's actively processing, right. So you have, for example, a hard drive, which
stores long term data. But the computer can process data directly from your hard drive.
Before doing that, it has to load it into your ram into your random access memory again,
usually, right a computer is going to have what eight gigabytes 1632 doesn't matter.
Let's say you have eight gigabytes of memory, that at some point is going to translate to
number of bits that your computer can store. So if you follow, if you follow these we have
right here, you can see the total number of bits available in a regular computer with
eight gigabytes of memory. Why is this important? Because again, the objective of these of these
tutorial is the objective of this bar, at least is to explain how you can squeeze out
of every single bit you can in your computer, right? How can you make it more efficient?
For your numeric processing, both in storage? use less memory for the same data? And also
how to make it faster, right for your calculations. So in terms of physical storage, or actually
memory storage, right? How can we make it? How can we optimize to use the least amount
of memory for this given problem? That's the objective of optimizing it, we need to understand
how numbers decimals or sorry, integers into decimal numeric system are represented in
binary, right. So these table right here shows you the first nine numbers, 01234, etc. and
their binary representation. In your computer. Let's say you want to store the age of user
age of a user, which is 32. You can't store 32 in here, because your computer again doesn't
know about decimals, it only knows about binary. To do that, you will need to find the correct
representation in ones and zeros of 3030. All right, sorry, two, which is not this one,
to be honest, I'm just making it up as we go. But again, you need to know the correct
binary representation of these number in norther. To store that data, how can you know that?
Well, there is this whole binary arithmetic, right? There's a whole part of math dedicated
to binary doesn't matter for now. But I'm going to just drive the intuition of it so
you can have a better understanding. And if you're interested, you can dig deeper later.
So basically, any decimal number needs to be stored in a binary format, which of course
only steaks ones and zeros. And what we usually do is just we keep increasing zeros and ones
in positions, right. So in this case, we have the number zero, the number one, that's fine.
Once we need to store the number two, winning now to increase the number, the position right
here we need to increase, right, so we need to go from two to one zero, we'd go to the
number three, it's one one, and then we need to go to number four, we need to increase
positions again, because we only have two symbols, zero and one. So as you're seeing
right here, up to these level, we need only one position. Up to this level, we need two
positions. This level, we need three positions. And these levels going to need four positions.
And you'll see how the size of each of these is increasing. And it has a an explanation behind it that we're going
to see in a second. So the question is how many decimal numbers you can store with n
bytes and bits, sorry, bits. So let's say we have n bits. And let's say n is equals
to three. That means that you only have three positions, right three bits, how many total
decimal numbers, you can store with it? Well we can store 000 we can store zero, we can
store 100 we can start stores are you one zero, right? So in this size, we can store
up to here, we can store up to seven numbers 111 is equals to seven was, once we've filled
all the positions, right, we've reached the limit, right? The largest number, the largest
binary for this amount of symbols or positions. That's the number seven. So these means that
with three numbers, you can start from zero from zero, here, zero up to one, one. In total,
you can store eight decimal numbers, here you have eight decimal numbers 012345678,
total decimal numbers from zero to seven. The equation if you want behind this is as follows.
If you have n equals three, and it's, in order to know how many decimal numbers you can store
with those bits, it's two to the power of n, in this case, is total a bit. So if we
go back into our drawings, we said that with three bits, we can store up to eight decimal
numbers. And again, the equation is two to the power of n is going to give you how many
decimal numbers you need. You can always do the opposite process using logarithm and get
how many bits you're going to need to create to store a given decimal number. I'm, I'm
not going to get into that. So we don't complicate it. But again, the math behind it is extremely
simple. So now, moving forward, we're going to delete this whole thing. Moving forward.
Why is this important? When you're working with your data, when you're doing your data
analysis, you know what, what data you're what type of data, you're working with their
own numbers, but numbers only usually have a connotation behind, right? So let's say
that you have here it's a table of people, and you have the total net worth of the person.
And also you have the age of the person. The age is a value that will range between what
zero, right? Just born to, I don't know, 120, we can say I don't know,
what's the maximum age registered right now, the oldest human being but zero to 120, it
seems, seems reasonable. In your other column net worth for this person, the range is it's
completely difference. We can go from something like $0 up to, I don't know $60 billion, I
think Mark Zuckerberg or Jeff Bezos or one of those. So we go from zero to 62 billions
in this case, if there are dollars, what happened if this is a highly devaluated currency? Would
we have to go to trillions, right? So these two even though they're just plain numbers,
and we can say they are integers, even though these are pulling numbers, they have an integers,
they have a different connotation, and they will need different requirements in terms
of storage size, right? So if we say that nh goes from zero to 120, we don't need so
many. So many bits to store it in memory, right? We can do the math, actually, how many
bits Do we need in order to store 120 100? And what do we say 120. Right? Well, if you
do the math, you will see that two to the power two to the power of seven is 128. So
if you have if you have seven bits here, seven bits, you're going to store from zero, up
to 1111111, which is actually 127. Okay, these number, all ones, seven ones in binary is
equals to 127. in decimal, in total, we can store 128 numbers 00 matters, up to 127. So
that means that for our column right to column, age, here age, we only we can use the size
of the men We need to use is going to be seven bits per user, or costumer or person, whatever.
What about these number right here, if we have to go up to a couple billions? Well,
in that case, the numbers a little bit more complicated, we're going to need, for example,
we can say 64, or 3232. It's actually 64, probably, but with 32 bits, right, you can
store up to from zero up to these volume. So again, I don't know about the currency
we're using or anything, so we can assume. But here, we need 32 bits in order to store that. And now you can
do the math, how many how much memory space you need, in order to process this data? How
many records Do you have, if you have only 1000 Records, that's not significant. You
can use whatever, you can use 64 bits here to store the age, and you're not going to
have a problem. But what happens if you have more what happens? What happens if you have
the entire population of the earth, you have 7 billion records here 7 billion records,
then every bit that you're saving in these columns is going to be important? Because
he's going to take a ton of data. And of course, you have a ton more columns, right? What happens
if you are processing trillions of records from financial transactions, right, you want
to be very, you want to be very efficient and optimize every single bit, you can't.
And that means again, selecting the correct number of have a bit per the columns you're
currently processing. So so far, so good, again, when there's 10, that the the number
in decimal that we need to store has a correspondence with emits, right? eight bits is one byte.
And the more we can optimize that, the less memory we're going to use for our obligations.
Where does NumPy come in place? Why are we talking about data in these NumPy lessons?
Well, they're right. The idea is that NumPy is a library that will lead you has a very
advanced numeric processing, in order to let you select the number of bits you want to
take for an integer. Even more, let's say you for forget about NumPy, you want to process
this thing with pure Python. So you x equals five, for example, working with Python, you
create a number, we're storing age as a five, how many bytes? How many bits? Do you think
the simple variable takes in memory? How many? Well, in reality, even though we think it
should be around, what, three, three bits, eight, let's say to be simple, too simplistic.
In reality, for Python, this is going to take around 20 bytes. Okay, so we are wasting a
ton of memory in order to store this number. And why is that? Well, because Python is a
high level, object oriented programming language. The reasoning behind it is that Python is
simple to write, write simple to also read and, and, and code on top of it. But again,
in order to create that simplicity, in its syrup, all the numbers in objects, which have
all these attributes, that if you know, advanced Python, you're going to recognize that are
not necessary. So these is taking a ton of memory. And a regular, very simple number
in Python ends up consuming 100 times more memory than what it should be consumed. And
this one NumPy comes in place in NumPy, you can create numbers that are for example, you
can control the size, in terms of bits, you can say I want to create a number that has
only eight bits. And that's it, that you're going to create a one byte integer, and you're
very precise and how much memory it takes, you can create a number that it's actually
need a little bit more space, we're going to do NP int, and we can hear us a talkie,
you're going to get auto completion 6016 bit or eight or 32 or 64, right. So we can actually
be a lot more precise in the number of bits that we need. And this is extremely important
for again, our high level processing. On top of that, NumPy is our array processing library
at NumPy is 99%, about processing a race constantly processing erase the data structures we have
in Python, the built in data structures we have in Python, for example, the list dictionary,
they are not optimized for high level computing. So if you have a list of numbers in Python,
let's say you have, I don't know, l equals 3224, right, you have three
numbers in your list. In Python there, it's not guaranteed that the least they'll the
list is gonna contain all the numbers, three to four in contiguous positions is gonna,
it might put them in separate positions in memory. On top of that, you can't rely on
advanced CPU directives and instructions for processing matrix matrices, sorry, because
Python, again, is wrapping these things in objects. So there is no access to these high
performance, low level instructions with NumPy, that changes because when you create an array
NumPy, you say, I want to create an array of three numbers, and they are all into eight,
then imposition forget about this is not these are not bytes I am, I'm using these drawing
as a general representation of memory. So in that case, in NumPy, when you create these
three element, int, eight array, it's going to create those three elements in contiguous
positions in memory, three to four, and they are only going to take that amount of memory
the police said they were going to take and on top of that, we can rely on a bunch of
very efficient low level instructions from your CPU for matrix matrix calculation, this
is something that it's a little bit more advanced. And it's something that has exploded in the
past 10 years CPUs with more with richer instruction sets, and the same thing for GPUs, you might
have heard, especially with machine learning and all that we need, we need fast array processing,
when we are storing features and weights and all that's a topic for for different story.
But again, the idea is we need right a ton of week, sorry, we can use all these important
and very efficient, low level directives from our CPU, which makes our computations a lot
faster. So again, as a recap, you don't need to know all these to work with NumPy. That's
the first thing. Second, you don't need to get extremely, extremely conscious about all
the numbers you use. At the beginning, you're just going to use NumPy as it is, and you're
going to use just the default types that it picks in 38 cents or in 32. In 64, that's
fine. But then, with when you get into bottlenecks,
when you're working with with larger amount of with more amount of data, then you might
need to get into the details of that size of the integers that you're using. And this
all applies to float. So I'm just using integers because it's simpler. But it's all applies
to floats. So again, NumPy, the main advantage is that it's it has built in very fast and
I raised kit, take advantage of CPU instructions for matrices and arrays and all that. And
it also has a very efficient representations of numbers, right are not the regular objects
of Python. Again, recap, you don't need a list. If you want to get into more details,
I recommend you to get a little bit more understanding about binary arithmetic, and how numbers are
uncomputable architecture, how numbers are stored in memory, etc, especially for floats
and all that's a completely different representations. So with that said, we're going to see now
how we actually use NumPy without worrying so much about the low level details. And that's
the beauty of NumPy. So we have already done our low level explanation of binary arithmetic,
why unknown vice important and all that if you skipped it, that's perfectly fine, you
will not need it. The reasoning was to include was that if you're in this tutorial, you're
probably looking for fast and efficient options to process large volumes of data. And that's
when all those things come into play. So let's without further ado, let's just get started
and start using NumPy as a library. So again, as I told you, a NumPy is a very simple library
for array, processing and numeric powers. To sing, it has a few objects, numbers, floats,
integer floats, arrays, and that's it. And it's very simple, but it's extremely powerful.
So, in NumPy, we're going to create these arrays, which look a lot like Python lists,
but there are going to be significant differences. The first one is, of course, performance.
If you go to the previous part, when we were discussing the binary representation of an
array of numbers, in Python and NumPy, you're going to see the difference between them.
So in this case, we're creating two arrays. And you will see right that the creation is
extremely simple. The only thing that changes we need to add this NP dot array, and then
we're passing in this case, a list of numbers. This is something we will usually be reading
from external sources. Now, how can you access individual elements of a NumPy array is this
works in the same way as with a Python list, you can say give me the first element, give
me the second element. And it's zero index, like, again, in a Python list. Slicing works
the same way. So in this case, up a zero to something, a one up to three rights, just
getting low level, right, on high level of the index, negative indexing, and steps, they
all work in the same way as with a Python list. So if you know how to use a Python list,
you will know how to use a NumPy array. There is one new thing right here so differently
from a Python list. And it's what it's called multi indexing. Let's say you have a, an array
this case B, and you need to extract three elements out of it, you need the element of
the first position, third position and last position, you can just type B of zero, B,
A to B at minus one, or, and this works, this also works for a list. Or you can use again,
mod the indexing, which is from B, I want to select the elements in zero to n minus
one first element, third element on last element, right, so you pass an int, another list containing
the indices of the elements that you want to select. And in this case, the important
part is the result. It's another NumPy array, it's not just individual elements, you're
creating another NumPy array, which again, if you're processing, it's gonna be a lot
faster. So arrays have types associated. And this is related to what we were speaking before.
As a NumPy array is a continuous is continuously assigning memory, the NumPy library needs
to know what's the type of the object you're storing, you can't just or you know, anything,
a string a number within it, because it will not be able to provide performance and optimizations for
arrays or non consistence insights. So for example, when we create these arrays only
had injures by default NumPy selected in 64, is because of the platform, it's a 64 bit
platform, you can tune this, and you can select us, we're going to see other sizes in a second,
when we created the array B that contain decimals or floats, it assign a different type, which
is float 64. Again, the default type is always six, at least in this platform that is 64
bits, it's going to be float 64 and integer 64. You can always change that you can say
Actually, I want these, even though these are all integers, I want you to create them using a float type, or as we saw
in our previous video, we can say it should be actually type integer x. So smaller integers,
for performance, for performance for better performance. Alright. So moving forward, we
were also going to see a few other types like for example, strings on the regular objects.
But as you're going to see this, there is no point of storing these things in NumPy
NumPy, stores numbers date Booleans, but not a regular individual objects as we're seeing
right here. There is a way to store strings, it's perfectly valid and it has its own time.
Its own type sorry, and it's related to the Unicode representation memory etc. But again,
NumPy is usually used for numeric processing. So the idea of NumPy arrays is we can create
multi dimensional arrays we can create the what we had created before. This is a one
dimensional array right? Just one dimension, you can create matrices, which in this case
are two dimensional, we have two rows and three columns. And NumPy has a ton of attributes
and functions to work with multi dimensional arrays. So the first thing we're going to
see is the shape of an array, which is two rows by three columns, how many dimensions
it has, it has one vertical and one horizontal, we have two dimensions. And what's the total
size of the array in this case, the total size is six, the total number of elements
we have, let's go one dimension. Further, let's create a three dimensional object, a
three dimensional array, which is basically a cube. In this case, for B, we have that
the shape is two by two by three, the number of dimensions is three, and the size is a
total count of elements. 12, you always have to be careful when you're creating these multi
multi dimensional arrays. If the dimension dimensions don't much, like in this case,
right here, where we have this second list that only has one less than bits in it, then
the dimensions will not match. And it will just tape you they'll use sorry, that the
array is of type objects. And the shape is only two only has two elements, these one
element, and there's another element. So in this case, we've done we've done it wrong,
basically. And you have to be careful when creating these these objects by hand. So how
can you index and slice matrices? We've done it for a one dimensional array. So we were
selecting elements, individual elements, give me the first element, give me the second element
cetera? How can we do it with a matrix with a matrix, what we're going to do is going
to be very similar to what we did before. The difference is that now we have to account
for multiple dimensions when I do give me a at one, is it the column add one, or is
it the row at one? Well, as you can see, it's the row. So this is going to be right here.
012. Right. And there is also another dimension, right? So this is 012. In terms of index,
index positions for our slicing. So here, how can you get the first element, the first
element of this second? rope. In that case, you're going to first select the first row,
the sorry, the second row, and then select the first element. And that's what you get
number four. But there is a better way, which is by using the multi dimensional selection
of NumPy. In this case, you're going to say from this matrix, I want to select and here
you're going to pass a in this case, you're going to pass dimension one dimension to dimension
three, dimension four, etc, right. And these are selectors for each one of those dimensions
that you're passing. In this case, we say, for a row level one, the element, the position
one second element, and for a column level, we want the first element in it. And it's
the same thing as we did before. The advantage of this index and keeping in mind and remaining
it is that it will also let you add slicing, right, so you'd say I want to select every
thing from dimension one, which is rows. So in this case, you say from zero up to two
is these two ones, the two is not included upper limit the same as as Python. And then you can also pass other other dimensions,
you say, I want to select every row, that's fine. But then I want to select from column
level, I only want to select the elements up to two. So these two and these two, and
the two, right, so 124578. These all works as intuitive as it gets. Remember this syntax
is the important that you need to keep in mind. Moving forward for modification, you
can say I want to assign these new array to this entire row, right? So if the dimensions
match, that is going to work now 10 is equals it's added to the second row, or you can just
use what we call usually an expand operation. We're just going to say for row number two,
I want to assign the number 99 and NumPy is going to take care of expanding it into this
corresponding array, given the number of dimensions that you have So so far that selection, it's
simple, we're going to see also is that NumPy has a huge advantage of containing a ton of
operations you can perform on top of your arrays and matrices, your multi dimensional
arrays in general. So the first one is the all the summers basic methods we have. So
given an array, all these methods are already built in the sum, the mean average, right,
standard deviation, variance, etc. And that also works for matrices. So in this case,
we can get the sum the mean standard deviation, or we can do it per axis. So this is very
useful, we can get the, the here, let's compare these two, there we go, we can get the some
of these, what is this first column, the second column or the third column, we can get it
the first row, second row and the third row. So it's either this dimension, this dimension
one, or it's a vertical dimension, which is x equals one, right? So per row per column.
Or, if you have more dimensions, you can just keep increasing the number of this answers.
And that's just going to work as expected. Broadcasting vectorized operations, this is
a fundamental topic that we're going to talk about. And it's going to be extremely related
to Boolean arrays. And these are a few new things that you have to keep in mind with
working with NumPy. And now we're going to talk about vectorized operations and broadcasting,
which can be a counterintuitive topic at the beginning, but then you're going to understand
how much sense it makes. It's one of the fundamental pieces of NumPy. We've seen how NumPy works
in a very general way we saw the multi dimensional arrays and all those advantages. But you might
be thinking, I mean, I don't need another library just to complete the summer domain.
When I show you the vectorized operations and broadcasting part, this is going to make
a little bit more sense of why NumPy is so important. So to get started, we're going
to have these array, which is a right, that's just very simple array vectorize vectorized
operations are operations performed between both arrays and arrays and arrays and scalars,
like in this case right here, which are extremely fast, they're optimized to be extremely fast.
In this case, what we're going to do is we're going to sum the entire array plus 10. And
what it means we're going to see an example of what happens without with Python. But what it means is that let me show you
the results, that each one of the elements within the array will be applied the same
operation. So usually, that's the concept of vectorizing an operation you have the number
and then this operation is applied to each one of the elements in here are actually in
these other one, right, so here and here and here. And here to result in these new array,
the operation is expressed at an array level, right, we say a plus 10. That's it. But then
again, internally, this is broadcast said to each one of the individual elements within
the array. And this gives me how a plus 10? Well, a times 10, for example, which also
in this case is we're playing the times 10 operations to each one of the elements in
the array, resulting in a new array with the result of that operation. And these resulting
in a new array is very important, because as we're going to see, NumPy is an immutable
first library, it will not any operation, you performing an array will not modify it,
but it will return a new array, if we check the status of a, you're going to see that
the elements are the same, it has never changed, we are creating a new array and returning
it. There are ways to override these behavior if you want. And this they all these operations
were performing these way always have the interface of plus equals minus equals times
equals etc, which will indeed modify their rights. In this case, we're making a broadcasting
operation, adding 100 to each one of the elements in this array. And now this operation was
immutable. A was modified and did it hasn't returned a new operation. If you remember
from your pure Python skills write the correspondence of vectorized operations are list comprehensions,
in which you're expressing an operation for each one of the elements in your collection.
Right. So that's a list comprehension. It's a it's pretty similar to what we're doing
with NumPy. The main difference is that this is all optimized and extreme. It's extremely
fast. So, the operations are these vectorized operations are reduced broadcasting doesn't
need to be only between arrays and scalars can only be between arrays and arrays. So
in this case, we have a and we have B and showing you right here. And we can do something
like a plus b. And what you're saying is that if there is a correspondence, right, so zero
plus 10, one plus 10, two plus 10, right? Let me, let me do it in this way. 110 210
and 310. There we go. And that's the result that we get right here. So these for these
to work, you of course, need the arrays to be online and to have the same shape. But when that does work, then the operation
is extremely fast in memory. And it's aligned, it's a vectorized operations with seen so
far. Why is this topic of vectorize operations so important? Well, because of the following,
which is bull in a race. And this is a very, very, very important thing. If you don't completely
get it now, I asked you please, to go and check the exercises we have for this lesson,
because we're gonna use it a ton. And we're gonna, we're gonna see that in pan, this,
the same syntax, the same primitives of Boolean arrays, a play apply, and we're going to use
the same things. So why are Boolean arrays similar to vectorize? operations? Well, all
these operations we've had performed here are just arithmetic operations, mathematical
operations, plus something times something, etc. If you look at the operators that you
have in your programming language, it's it's not only mathematical operators, like plus
or minus, or times, you also have Boolean operators. And the question now is going to
be what happens when you apply Boolean operations, when you apply Boolean operators to it. So
given our right, we had, what ways we had to select different numbers. For example,
in this case, we need the first and last element, we do zero and minus one. That's, that's the
way we saw with NumPy. We also saw the traditional Python one, right, so we can say a zero, and
also want to get a minus one. So this is the first, the first way of selecting these elements,
we know there's a second way with multi index selection. And there is a third way and this
is new, which is with Boolean arrays right here. So in this case, we're gonna say I want
to select the elements in this order, right? And you're gonna pass either true or false
if you want to actually select the element or not, right, so if you have four elements,
you have to pass four Boolean values, saying, I want to select this element, I don't want
to select these ones. I mean, I don't want to like this element. And I do want to select
this element right here. So I want the first one, and the last one, and the result will
be the same 030303. So so far, it's it's nothing terribly new, right? So this is new, but it's
not extremely complicated. We are showing you a brand new way of selecting data, you
can select regular Python multi index, or a Boolean array. Now, you might be thinking,
well, I manually write true false false, true, true false, for I don't know how many records
you have a million records, this is not scalable, right, you will not say to write all the strong
forces. But this is actually very important, because these arrays are the ones that are
the result of broadcasting Boolean operations. So we saw again, regular arithmetic operation
like this, but we also have it for Boolean operations. So we what happens if we ask a
greater than or equals to the number two, right, and array A is this right here is 0123,
then the result is false for zero, false for one, because they are not greater or equal
to do true for number two, of course, and two untrue for number three. So all the individual
elements that match this condition will have true and false. In other cases, this is the
power of Boolean arrays, we will be able now to combine these operations. So now we can
do a greater than or equals to two, right that a equals A being greater than or equals
to the number two. The advantage of this is just filtering, we're
filtering No, no numeric arrays very quickly with a very familiar syntax a greater than
equals to and we just provide that as the index of the operation. It's pretty much What
is happening right here? We're saying use these Boolean array. It's a willing list,
right? is a Python list with Boolean, to filter or sorry to select elements based on that.
But the question is, how do we construct that list of Boolean? Well, in this case, we have
constructed it by including a predicate by including a condition that needs to be matched.
The result, again, is filtering. It's a query method, you're looking, looking up some data,
you're saying, Give me all the elements that match this condition. So you can say, for
example, these values can be of course calculated, you can say, give me all the elements that
are greater than the mean. Or you can actually provide other Boolean appraiser operators
like for example, all the elements that are not greater than the mean. So that means they're
less or equals and the mean, or you can also include all their Boolean operators like or,
or, and so or n and in NumPy, are expressed with a pipe or an ampersand ampersand. Because
we can't use just the regular or and then in Python, we can, but it's a good choice,
they've selected this. So again, this is the concept of Boolean arrays, we are going to
construct these arrays that artist Boolean representations or Booleans, based on conditions,
right, so we have this matrix, and we're gonna say I want to select these one, and these
one end is one, etc. So in that case, this is the result right here. This is the result
of that. And we can generate a dynamic Boolean array, we never manually type all these right,
we don't sit and say true, false false through etc. We just Run Query filtering option, a
Boolean operation, which results in a Boolean array. And now we can use it as filtering.
So again, the idea here is that the operations we saw in broadcasting before, a timestamp
are also defined for Boolean operators. Boolean operators return Boolean, a race, which can
be used in filtering, that's the idea of all of it. And you can even combine these operations,
you can say, A equals zero, or a equals one, a less or equal to two. And it's also divisible
by zero, you can combine all these queries. So now it looks a lot more powerful than when
we were doing before. So moving forward, talking about linear algebra very quickly. And this
is we're approaching the end of the NumPy lesson. The part the important part of of
linear algebra is that NumPy already contains all the most important operations for it already
optimized with low level semantics, it's going to be extremely fast, adult product cross
products, and all that transposing majors is all that works as expected. And again,
these might be very important, specially, for example, machine learning, etc. It's it's
extremely important. And finally, to wrap up what we saw in our, in our binary explanation
at the beginning, what you might have escaped is the difference in sizes between NumPy and
Python, the differences in terms of performance between them. So in Python, a regular number,
this is just a regular engine in Python, that total size is 28 bytes in order and just let
this thing for a second. The total number of bytes, not bits bytes that you need in
Python to store a simple number, as the number one is 28 unit 28 bytes to store just the
number one is extremely, super space consuming, right? It's not very
efficient, larger numbers will even take more bytes to store them. What's the size of the
integers? Well, we've seen it we have, for example, we can create integers with eight
bytes. We can create integers with one byte right which were something like here we have
np.int eight will already know how many bytes has only one byte, right, but you can have
control of how many bytes or bits write your numbers will take. And you can see here the
difference between the size of an integer in Python which is extremely large 28 byte
on NumPy and also the difference in performance. Let's say for example, we want here you have
the ultimate difference in size of lists, which is also significant. But I want to focus
on performance, we have two elements two, we have one list that has the first 1000 numbers,
I will have a NumPy array that has the first 1000 numbers, we're going to perform the same
operation in both of them. Let's use the Python one. First, we're going to do the Python one
first. In this case, we're, we're squaring all the elements in the list, okay, the elements
A squared, and then we're summing all the operations might so we express it at saying,
create a new list, x times x, sorry, squared, 4x, nl, and then some everything, how much
time it takes 321 microseconds, we're gonna do the same thing with NumPy, we're gonna
say NP dot sum, a square. And you're gonna see that it's a lot faster in the NumPy perspective,
then the Python perspective. And these are all very, very tiny, tiny operations with
small numbers. What happens if we add more numbers, let's add two more numbers here.
That's odd. Two more numbers here. And we're going to do the same two operations. So as
you see here, that that the units have even changed, we're still in the microsecond layer
here with NumPy, we've gone to the millisecond layer in Python. So as the size of your objects
increase, NumPy will prove to be extremely fast compared to Python. So there are a few
other functions you can see here, for example, extracting normal, random numbers, etc. I'm
going to live let these for you to look, if you're interested in them, I remember you
have the exercises, which can help you solidify all the concepts we discussed. And we're going
to move forward now to work with pandas, we're going to see also visualizations are gonna
keep moving forward this data analysis with Python tutorial. Now, it's finally time to talk about pandas
is the most important library that we use for data analysis in our day to day basis
with Python. It's a library that will aid in the entire process of your data analysis
project, you're going to start getting the data, step one, getting the data from multiple
sources, like databases, Excel files, CSV, files, etc. That's all gonna get into pandas,
you're going to be processing the data, right? So you're going to be combining merging, doing
different types of analysis, you're going to be visualizing the data, right, so a bar
chart, you're going to be visualizing the data with pandas, and you're going to be creating
reports, you're going to be also doing simple statistical analysis, you're going to be doing
machine learning close to it, with the help of other libraries, but everything from the
platform that provides the pandas library, it's, again, one of the most important libraries
in in in the data analysis data science ecosystem with Python. pandas has recently released
the version 1.0. So we are talking about a very mature library. It's been around for
a long time now. And again, it's the primary library that we use in Python for data analysis
and data science. So I'm going to do a quick introduction to the data structures of pandas
house, and we're gonna understand how they work. So you can start building right the
phone, we're gonna start building the foundations, I need you to be very familiar with the way
the data structures from pandas are processed. And then we're going to move into other things
like reading files, grouping data, etc. So to get things started, we're going to talk
about the first data structure to pandas house, which is this series. In reality, pandas has
two main data structures that it uses all the time, and it's the series under the data
frame. The data frame is the one you will probably be more familiar with. It looks just
like an Excel table. But we're gonna start first with a series. Okay, so just stay with
me here. We're going to talk about a series for a second. In this case, we have important
pandas, and we have also imported NumPy. As, as you might imagine, as I told you before,
in the NumPy part of this tutorial, we're saying NumPy is fundamental for data analysis
because every other library pandas, matplotlib, they all sit on top of NumPy and you can see
it right here. We're gonna be using some features from NumPy within this lesson, too. So these
is a series in pandas, what you see right here, it's The concept of a series is this
ordered sequence of elements right? Or indexed right with they are all indexed by a given
index, of course. And you might think that this looks a lot like a Python list, right?
So in this case, we're storing the population of countries, right in millions of inhabitants.
In this case, it's jevelin. g7. pub is because we're getting the population of the Group
of Seven, you can console the Wikipedia page. But basically, we are storing population in
here in this series. And again, it looks a lot like a list, but we're gonna find a ton
of differences in here. So the first one is that the the series has an associated data
type. And this is something we saw in NumPy, when a NumPy array couldn't hold different
types of objects, we were all we were only having one type of object. In this case, it's
float 64. So all the numbers of the series will be of type float 64, the underlying data
structure, the 10, this is using to store these objects is a NumPy array. So a second
difference we see very quickly is that zeros can have a name, right. So now when we display
the series, we see that it has a name. Now it might not make a ton of sense. But once
this series is part of a data frame in the form of a column, then the name is going to
make a lot more sense. So moving forward, again, we saw that A has a type. And again,
this is because the backed the data is backed by a NumPy array that you can always consult,
you can check values of a series. And you're going to get the array that it's backing up
that pandas series, right, so you can see that it's a NumPy array. Once you have these series, we were just consulting
here, design pop, you can in you can select elements as you good in a regular list, right?
So for example, give me the first element, give me the second element, the last element,
etc. And that's because a series inherently has an index, similar to list a list when
you create a list in Python, right? So if I create L equals a, b, and see, but there
is something wrong here missing, quote, this list, we don't say it right. But the idea
is that there is an index here, zero, this is one, and this is two, right? In the pendous
series, this is a lot more explicit, each element has an associated value within it.
And you might think that is pretty much the same thing. They're all they're both the list
on the series, there are both sequences, they're ordered sequences of elements. But we're going
to see that there is a fundamental difference, and is that we can arbitrarily change the
index of a series. So by default, when we created it, we didn't assign any indices.
So by default, it was a range index from zero up to n minus one elements. But you can actually
arbitrarily again, say, what is the index of your series. And in this case, these data
structure these series has now these indices that we're seeing right here. Why is this
important? Because now we're going to be referring to these values, not by a sequential position,
but by a name, but by a label by the index, which has a meaningful name for us humans.
Okay. So now, these thing looks a little bit more like a dictionary we could say, than
a list, we started thinking that a series was similar to list but now, we can think
that a series is limit similar to a dictionary. But wait, don't get me wrong here. The series
has a fundamental trait, and it's that it's still ordered something that didn't happen
with. With dictionaries, dictionaries in Python, are not ordered, actually, in python 3.7.
They're ordered, but we shouldn't be thinking that they are ordered their unordered data
structures. In this case, a series is in the order. So it has both those advantages. It's
ordered candidates always before friends, that's as we decided to create it, but also
it has names or labels or keys associated with the values as a dictionary. So this is
creating the series from scratch. Right? All these methods, you can see you can create
a series bypassing the index it doesn't have To be a two step process where you first created
the series, and then add the index, in this case, you can do everything at once. And the
indexing is now going to be done by those indices, right. So those labels that make
up the index will be used to index specific data. So g7 pop, we see has these countries
with these population. And now, before the index, we were saying, I want to get what's
the population of Canada, and then we had to remember, what was the position of Canada,
oh, it's the first help countries, we have to do g7, pop zero. With the index, now we
can just consult what's the population of Canada, what's the population of Japan. And
as you can see, the syntax is the same as with a Python dictionary, it's just pretty
much same, you pass the key and is going to get the value. So again, summary, the advantage
of a series is it's it's a ordered sequence of elements, backed by a NumPy array, very
efficient very fast. But it also has an index that can take any labels we pass,
so it's going to make it a lot better for indexing, you can steal when you have a series,
you can still get the elements by the sequential ordering. After all, it's a sequential data
structure, and doesn't matter if you have in an index, you can still say, Hey, I know
we have on the index. But if you want to get the last element, or the first element or
the second element, you're going to do that by using the attributes, I look at it and
say to this series from this series, I'm going to ilok locate by sequential position, these
element the element in position zero or the last element. And that still works as expected
series also support multiple indices as we saw with NumPy. So in this case, we can get
two elements out of two, three n elements, you can pass multiple indices. And the same
thing happens with more with sequential multi index series also support range or selection
or slices. But there is a fundamental difference here, this is very important here attention,
there's a fundamental difference with Python, and it's not in Python, the upper limit of
a slice is not returned. So from our list that we created before, if I do l, up to number
two, I don't get the index See, right, so this is zero. This is one, this is two, two
is not included in our pandas series, the upper limit is indeed included. So if when
you asked from Canada up to Italy, Italy is in the result. Okay, so this is something
to consider when using index selection in pandas, I think this is still valid, it's
very, I understand the reasoning behind it's just different from Python. So, you should
remember, Boolean arrays, which was a topic we discussed in our previous lesson of NumPy.
Boolean arrays is still a thing in pandas, the difference is we instead of saying Boolean
arrays, we should say Boolean series right, the idea is that we will be able to perform
operations on top of series. So for example, right here we have mathematical operations
on top of series in this case, we have the zero D seven pop, which as I told you the
beginning is in millions of inhabitants. If we want to get the series of interest units,
we will need to do Jessamine pop times 1 million and there we go now is in terms of units these
operations right these vectorized operations the bras these broadcasting operations can
also be performed with Boolean operands. So instead of a multiplication, a summation and
subtraction, etc. We can add we can use a Boolean operators. So in this case, we get
asked what are the countries that have more than 770
million inhabitants we will receive receive their assault is a bull in aerates, Nebraska,
right? Well, let's hear it you know, but basically, it's the same concept of with us with a NumPy
Boolean array. Canada, friends, they do not have more than 70 million inhabitants in Germany
does have seven more than 70 million inhabitants here. 80 on the same for Japan, so Japan Here
is the same on the same for the US, the US also have past more than 70 million inhabitants.
So again, the Boolean array or Boolean series in this case, works in the same way, as with
NumPy. And selection also applies. So I can now select, I can say, give me from these
series g7 pop, all the countries that have more than 70 million inhabitants, the value
is more than 70. So now, again, we are building filtering, we're building a query language
if you want on top of pandas, we're selecting data based on this condition. Remember, when
if you ever have trouble remember all these, the idea is that you can always track down
the way this index is being built. In this case, we are it's not that the selection knows
anything, these first election knows anything about how to select countries with more than
70 these operation was performed first, which resulted in these series. And now this series
will be indexed by these array, this Boolean array. And the result is as you can see it,
and again, these operations can be run with calculator methods, and all the operators
we saw in our previous lesson, which was not, which was or this irregular pipe, or, and
amberson, which is the and all these can be applied in any order you want. So if we read
this thing, which is complicated in purpose, it's worth saying give me all the elements
that are above the mean, minus two standard deviations or below the mean, actually, above
the mean, and here was below the mean, or if this isn't correct, but it doesn't matter.
It's just an OR operation between two ends of the it's actually, it's above the mean,
minus the standard deviation. So we are applying this operation or right, that operation we
have before so they're not the or, and the and they all work with Boolean selection as
well. The operations we saw from a mathematical perspective mean in in statistical operations,
we saw a NumPy. Some mean, average standard deviation, we're actually using standard deviation
before, they're all still relevant in this case, but also you can use traditional NumPy
functions with our pandas series, because again, a panda's series is internally backed
by a NumPy array. So this is all the same, as you can see, here
is an example that it's a little bit more clear, we're getting all the countries that
have more than 80 million inhabitants, and all the countries have less than 200 million
inhabitants. So it has to be above 80. But it also has to be below 200. Okay, or in this
case, we say either above 80, or below 40, or below 40. Right. So that's with the OR
operator or the NOT operator. Modifying series is relatively simple. Whenever you have a
value, you can just assign it all together. In this case, we're saying Canada is now 40.5.
I don't know why we just wanted to do it. This is by index, you can also do it by sequential
positions. So in this case, we're going to say the last country should have 500 now.
So we're going to see a right here, the last country has 500 now, or you can also modify
elements based now bool and selection. So you can say all the countries that have less
than 70 million inhabitants, all these from our previous query, all these will now be
99.9. So as you can see, it has changed all these countries. So this the assignment works
by direct indexing, or also works by Boolean indexing. And this is going to be extremely
important when we are cleaning data. So let's move forward and start talking about data
frames now before you have exercises in for series, and also for data frames, so I recommend
you to check them out. So talking about data frames, this is what a data frame is going
to look like. It's pretty much the same thing. us an Excel table. So this was our series
and this is going to be our data frame. It's a table. So it looks a lot like an Excel spreadsheet.
And actually, it's very common to create pandas data frames out of CSV files, which are tables
basically. And I'm going to create it we created with these data frame object I created. There
you go, these are data frame. And as you can see, right, it has columns that we have assigned.
In this case, we were designing the columns, and we have rows of values right below each
one of these columns. Why? What's the similarity with with series, and it's not a data frame
column will be basically a series. So we can think a data frame is a combination of multiple
series one per column, we're going to assign an index to the data frame the same way that
we did with our series. So in this case, this is our data frame. Sorry, right here. This
is our data frame that has the index, right? And it has the columns as we had before, what
columns Do we have, what's the index of the data frame, these are all attributes that
you can consult, there are a couple of very interesting methods from data frames that
we use all the time. The first one is the info method. That's going to give you quick
quick information about the structure of your data frame. Right. So it's going to tell you
what columns you have population GDP surface area, HDI continent. And it's also going to
tell you the types and how many no values you have, it's actually telling you how many
non null values you have. But we use these when we're cleaning data to quickly then define
those columns and have missing values, we can check for the size of the data frame,
we can check for the shape. And this is similar to a matrix right, a two dimensional array
in NumPy is pretty much a data frame. And also similar to info the voice again, to check
a summary of the structure of the data frame, we can also use this cribe, which is going
to give you a summary of the statistics of the data frame. And in this case, what we
see is that for each numeric column, only those columns are numeric continent is not
here, for example, this is continent so you can see the type is object is a string, basically,
all the numeric columns, we're going to have summary statistics for them. So for example,
for population, how many elements we have, what's the mean, right? What's the average
Romney, what's the standard deviation, the minimum, the maximum, and in between a couple
of percentiles 25th 50th and 75th percentiles. So this is quick summary statistics. And we
do this a lot. So keep in mind, his crime method is very popular. As you could see, in the in the info method,
the columns have associated types, okay, so this is very important. They continent is
an object that means that it's basically a string HDI is a float and surface area is
an integer. And that's because NumPy, pandas is automatically with through NumPy, is automatically
recognizing the correct type to assign to each one of the columns. This is similar to
what we saw with a series in which the series contain natural datatype, a series was part
of a given data type. So that's something you cannot change. And in this case, checking
value counts, you can have a quick reference of the types of your series. So moving forward,
how will we we will be selecting data from series Well, there are a couple of methods.
And this might be a little bit confusing. So what I'm going to do is I'm going to skip
and just going to give you a quick reference first, and then you can read if you want through
the process we follow here, given a data frame, and this is just two quick rules, given a
data frame, you're going to select by index using the lock attributes. So the lock attribute
is will let you select individual rows. So for example, when I get Canada and that's
the value of Canada, when I lock attribute will let you select similar to the series,
the row by sequential position. So let's say We want to select the last row. In this case,
it's the United States of America. So again, look lets you select a select rows by route
by index, give me the row under this index, I log will let you select rows by sequential
position, give me the last row, the first row, the second row, etc. And finally, without
using lock without using a lock, just by saying the f up something, you are selecting that
column, give me the entire give me a V and tire column population right here, the entire
column population. So what you're seeing here, first, first of all, this is a quick reference
dot dot Lok will give you an element by index, I look we'll give you an element by position,
I wrote by position and just doing df, on some things gonna give you the element, the
column sorry that you are passing. So it's like, both look on I look, look and I look
work in a horizontal ladder, give me this, while bf art, whatever works in in a vertical
montanus, which is getting you a given row. But something more interesting here is that
all the results, these one and these one and these one, they're all series, what are being
returned our series. So that's what we saw before. And the way it works is first, if
we focus in this last example, we're going to see that it's pretty standard, just these
series right here was is a one return I remember it has a type and everything. So that's, that's
fine. If If we ask for a row, like in this case, we can get for example, here easily.
There you go. The result is also series. But what you can see here is that this thing is
kind of transposed in a way dot here was the volume of this year is population is here,
and GDP is here and surface area is here HDI on continent on here you have volleys. So
it's it's again, it's it's being transposed, right from vertical to horizontal, in our
regular series manner on the index of this series is extracted as the name that the column hot. So in this case, the
name right there is the value of the index that it had. So you can read more about it
right here. But I just want you to remember these rules don't lock you select by index
dot I lock you select by sequential position, the F at something you go by column, there
are times when these might not apply. So or not want to apply, there will be some issues.
So for example, if your rows if your index is numeric, you might have issues with these
form or dot form, just respecting these three. For now, it's gonna get you any element you
want to get either by row or column. So from what we've seen, the oldest slicing also works
as expected. So we can get, for example easily, or we can get friends up too easily. So the
upper limit is included. But again, it's built look and we select by indices from France
to Italy, we can also do the second dimension similar to the way we worked with NumPy, we
can do second dimension here. And we can get all the countries that are from France, or
to Italy, including Italy, but only the population column or population and GDP. So here you
can see the second dimension being applied at the concept of of multiple dimensions in
selection being applied also to famous for ilok. It works in the same way that in that
then multi index and the slicing. So we get for example, from one to three right in sequential
positions. In this case, the upper limit is not included. So that's another difference
from what we have. And we can also do multi dimensions we can say give me the countries
from one to three and the column should be 0123 should be the third column, the fourth
column, the column under index three which is HDI, so that also works as expected. And
again, recommended, always use Look, I like to select rows and just use the naked data
frame to select columns as we saw before. Now moving forward, conditional selection
Boolean arrays will series, whatever you want to call it. This also works for data frames.
And it's very important, it's a way to filter data, it's a way for us to consult the data
when the when, when it so in this case, what we have is, we want to select all those countries,
which the population is greater than 70. Okay, so all the countries that have more than 70
million habitants, similar to what we were we did with a series, but in this case, we
want to do it with a data frame. So what you're going to see here is that we're going to construct
a Boolean series as we did in our previous video, right? So every country with more than
70, false false, true false. And we're going to inject that result, that Boolean series
in a dot lock selection, give me all the countries which match here than that the true value
in it. And remember, just this is kind of mnemonics are a way to remember, the way pandas
knows how to filter things is by matching this index, right from the resulting series.
With these index of the resulting data frame. These are two different objects, they are
completely different objects, but their index much. So here, Japan, March, Germany March,
so here, Germany, on Japan, they are the same, and that's why that thing is working us expect
that they This is just the first dimension, which is give me these rows, you can also
on the second dimension, saying give me these column, or these columns, right. So that's
steel, that's the awards us desire. So what about dropping stuff, you can say, whenever
you have data from you can say give me just these pieces, or you can say drop the others,
right, it's just pretty much the same. Dropping is very simple, you can drop by index, drop
this value drop Canada altogether, period, or drop these indices can in Japan, or you
can also drop columns, drop population, and HDI as columns. These ways also have a more
advanced usage, which is with access similar to NumPy. I don't recommend them so much,
but you can still use them and see them here. So all the operations we've seen, so far,
they're all working. The most important part here is the broadcasting operation that we're
going to do between series. So we're going to create a new series crisis. And I'm gonna show you what it looks like. So we have here
crisis. And we're going to perform a broadcasting operation between between these, I'm going
to show you what this thing looks like first, between that two, these data frame on the
crisis. And the result will be that we will subtract, I don't know what's this number
1 million, subtract 1 million for each volume in here. And we're gonna subtract 0.3 HDI
for each one of those. So what you can see here is again, this alignment between columns
and indices, the GDP here is matched with these GDP and the HDI is much with these HDI.
So there are two different objects, two independent objects, these series and these data frame
here. But when we combine them with an operation like this, the the columns in this case are
aligned GDP, and HDI and they work together. So you're gonna subtract these value in all
these column, let me remove this, you can subtract these values in all this column for
all these values, I'm going to subtract these value here in these column for all these values.
That's the way it's going to work. So moving forward, what about modifying data frames?
Now I wanna I want to show you something. And that's when we were dropping stuff before.
We were not actually modifying the data frame. So here we did df dot drop Canada, but df
still has Canada in it. And that's because similar to what happened with NumPy these
operations are all immutable. They are not changing the underlying data frame. We are
storing. We are storing we're creating new data frames that store the result of the given
operation. So in this case, you have to drop Canada. The result is that the these new data
frame but the underlying That iframe is not changed. That's because again, they are immutable
operations. 99.9 operations in pandas, it are immutable, there are ways to change it,
there are ways to make the changes permanent. But for now, I want you just to think that
everything is immutable. Whenever you want to perform an operation, it's going to create
a new series. If you want to keep track of this, you will just need to do something like
df two equals that, or even df equals, you know, just to modify the current data frame.
Again, there will be a way to not do that. But we're going to save in a sec. So modifying
series more explicitly, that affrontare modifying data frame more explicitly, how can you create
a new column? Well, very simple. Assign a column, I said, let's say in this, this column
right here, it says similar to say, here, language. Oh, it's just read only. But if
I say language equals, and I can just write whatever I want. In this case, what we've
done is that the language, let me show you what Lynx had, in this case, was a tiny series,
it didn't have elements for all the indices in the data frames, but that doesn't matter.
pandas will match all the indices of our chill exist. And it will live like the rest. This
na n is what we use for a blank. It's another number from NumPy. We're going to talk more
about it when we start doing cleaning data. Data cleaning, sorry. So again, links France,
Germany, Italy, you can see the volleys are all up there. What happens if you want to
change a value the language series already exist, you want to change it or column or
read exist, you want to change it. So in this case, we're going to say df language equals
English. So we're going to change it all together, df now will be affected, and all the values
of language will be English. How can you relate How can you realize when there is an operation
that is changing the underlying data from the underlying series or than the line NumPy
array, it's usually when you have an equal symbol, remember, NumPy, we saw something
plus equals, in this case, whenever you have a plus and equals symbol is you're modifying
the underlying data frame. So for example, check this out, the Rename
function or method of a data frame will let you pass columns and indices to rename. So
in this case, we want to change the United States to USA, the EU, United Kingdom to UK
and Argentina to AR, Argentina doesn't exist in this data frame. But that doesn't cause
a problem. And that's why we want to show you, the US, UK were modified correctly, and
HDI was modified correctly. And a PC which doesn't exist, didn't cause any problems.
Now, why am I showing you this because remember, these operations are immutable. If I check
what's the state of the data frame, we see that the original data frame has not been
changed HDI a steel HDI, it doesn't matter if we renamed it before, it's still the same
data from the same thing for days, indices, all these operations are immutable. A few
more examples of modifying data just for you to look at. And something that is very common
for us is creating columns that are combinations of other columns. So again, this is read only,
but you can you can imagine, that I could do is hear something like for example, GDP
per capita, right? If I go here, and I do GDP per capita, GDP, p per capita, per capita,
and here I say is equals to the GDP, this column divided by this column, right? So I
do something like B, B three, actually, C three, C three, divided by b three, right.
And then we would extend the values all the way along here. In pen this, we could do something
very similar. We can do just any column, we can just perform operations, broadcasting
operations between them, in this case, GDP by population. And we can assign that series
which is a result right there. So it's a series we are going to assign that series to a new
column. So GDP per capita Now, there you go is now a column of our data for. Again, all
these broadcasting operations are extremely fast, they are backed by their NumPy array,
and they result in a series. So very quick statistical information, a few methods, right
to do summary statistics. We saw them with this crime method. But minimum maximums mean,
median, all that works as expected. Something that I want you to note here, if possible,
is that with pandas, we have, I'm going to change colors here, we're going to use red.
With pandas, you have this concept of a data frame, right data frame that has multiple
columns, multiple rows. And these operations are resulting operations are resulting in
just one series. So in pandas, you have your data frame, and you have your series. And
we could say we have individual numbers. And it's like always, the data frame is always
resorting back to this, it's like some operations will just return a series. And the series
can be used in a data frame, right. So in this case, these resulted in a series, but
then we merely use the series to set the value of a column. Right. So that's why understanding
series is so important. So there are a few more assignment exercises for you here. So
you can check them out and complete them if it's going to make a little bit more sense
once you're working with it. Finally, I want to give you a very quick introduction
to reading the external data on plotting. And to do that, we're going to use a few methods
that are very popular in there, maybe we can look them up very quickly here, we can say
read CSV, use the read CSV function from pandas. So these function, read CSV. And as we have
read CSV, we actually have a few others read sequel, read Excel, read XML, there are multiple
adjacent or multiple ones, read HTML will be able to automatically parse an HTML page
and read it. So a few functions like these like, what we're going to do with these read
CSV, right here is the structure of it. A few of these functions will let us import
data from an external source into our pain this workflow. So in this case, what we're
going to read is these BTC market prize volumes, so it's right here, if I open the CSV, this
is what it looks like. It's the date of the price taken a read and devalue the bread,
the timestamp, and the value the timestamp of the value no decide the price of bitcoin
2017. Now it's close to $9,000, I think. But just note inside, but again, this is a CSV,
and this is a CSV that we're going to be writing. To do that, again, we're going to use this
method read CSV, the method will automatically parse the CSV, as expected. And there you
go. And the process now will be for us to start tuning it to get to the right point.
So I'm going to show you a few customization SP customizations, we can do with the receipt,
read CSV function. So the first one, and sorry, let me tell you first, we have a ton of attributes
here. So we have a ton of customization to do with read CSV, you will not remember all
this, you will not remember everything out of the top of your head. So don't worry, you
can always go back again to the documentation and just practice, it's going to come naturally.
So the first thing, the first row of the CSV was considered to be the column names. So
in this case, this fine lesson have a column name, let's say I add it, I'm going to do
timestamp, timestamp price, you're going to save it, I'm going to rearrange the file and
re re read it. There you go. So by default, pandas is assuming that the first line of
the CSV is the rd columns. I'm going to go back into what it was. Right, and I'm gonna
show you again, that's the assumption that pandas is doing. We're gonna Of course, of
course, change that assumption, because in this case, our CSV file does not have column
names. So we're going to just say Heather equals none. And this is when we start seeing
the attributes that we're going to use from the read CSV function, read CSV. When I do
hether equals none for us going to be known. That means don't infer don't read a header.
Don't try to infer a header, a header from the CSV file. And the columns are zero and
one. So now I'm going to change the columns. And I say, actually to be time something prize.
And now what I'm going to do is show you the first rows. So you're saying here that I have
these df dot head method that I'm doing. That's because this is a significantly large file.
So we're going to say not not that long, but at least it doesn't fit in my screen. What's
the shape of the day CSV or the data frame? It has 365 rows, and we have two columns.
So we can do df the info, for example, to have a little bit more reference about we
have 365 values, there are no no values, and price is actually float, that Tamsin is an
object and we're going to fix that in a second. I'm sorry, that the F that head on the F dot
tail, are the methods we used to get either the first and files or the end row sorry,
or the last n rows, which are five rows, by default, you can change that and say, Show
me the last three rows, for example, that's something you can do. And again, the types
so the types is the timestamp in this case, the timestamp column was not properly parsed
as a date, he was parsed as an object as a string, which we don't want. So we're going
to use the function PD dot today time, something we're gonna explore in more detail in the
reading in the cleaning data cleaning course. Part sorry, if it weren't tutorial, we're
gonna use it today time function to turn these column D f, the timestamp into an actual date.
And now we're going to say, the F that timestamp equals to this function resulting, and now
everything looks as expected, there is one more change that we want to do, we want to
set the index of the data frame to be the timestamp, because by doing so, we can quickly
access price information led me see what was the price of bitcoin in 2000 1709 29. And
I make a mistake here, I forgot to do the LLC. There you go. So we have the value of
Bitcoin. On these particular date, forgot, look, remember that to get value from a particular
row, you have to do dot lock. There we go. So we are getting Dodd's particular value.
Because we've made a timestamp the index, we get artists value directly from the index.
So what happens if you want to turn this thing into an automated script, for example, when
I run this process, every day at 5am, whatever we can, we want to read the CSV, strip the
columns, rename them turn into timestamps, etc. This is what we've done so far. Read
the CSV without a header, create the columns, turn it into a daytime timestamp into a daytime
and assign it to the index. And that's the result again, well, actually, the read CSV,
oh, sorry, the read CSV method is so powerful that it will let us do all these actions in
just one call of the read CSV method, we there are parameters that will let you customize
the behavior to achieve the same results that we did with four lines of code right here.
So in this case, we're gonna say, read this CSV, don't assign a header, that's something
we do already or don't don't infer our header from the first line. These are the column
names. So we don't need an extra line, we can just say these are the columns names.
Oh, and by the way, the first column is going to be the index of the data frame. Oh, and
also part of the date. They've the index, it's a date, so part of the date, and we have
the same result as before. So now I'm going to pro try and same thing. There we go. So
you can see it's work. So very quickly pan this plotting. Alright, so we're going to
be doing here is I want to show you very quickly, I don't know what's this thing is as a vertical
scrolling. I want to show you very quickly that you can create plots with Hannah's interest
a breeze. It's so simple to create a block. So in this case, what we're going to be doing
is, given a data frame, you can always invoke the plot method. And the plot method, what
it's doing, it's using the map plot live library, something that you can check if you want in
the docs. But for now, it's not necessary with these, we're going to be more than enough.
What it's doing is just using, again, the regular plug library, as you can see dimopoulos
Library, which is part of the standard PI Data stack. And again, for us to access using
pandas is extremely simple, just df dot plot, you're done, you can set the plot as you want,
we're gonna see more details of matplotlib. So don't worry too much about that later.
So there is a more challenging example here that I can just run very quickly, you can
inspect the process we follow to fix the data. But this is what we have, there we go. And what you can see right here is the difference
between the Bitcoin and ether in this period of time right here, and they are both loaded
in the same chart. And that's because this is the resulting data frame, we have Bitcoin
on one side, and we have ether on the other side on we are plotting it right here, we're
creating one plot with all of it. And we are noticing these empty value right here. So
what we can do is we can go from December 1 up to January the first these period, so
we can select that period, is in that lock. And we can just go ahead and plot it again.
And this is what you see right here, the gap that we're seeing. So again, this was the
introduction to pindus. We have a real life example of pandas following up. Also we have
a little bit of data, more data cleaning on reading all the interesting files and sources
of data for in getting more data into the pipeline, right. So the idea is going to be
showing you how you can import data from Excel from SQL and then do the actual processing
and analysis. Now it's time to talk about data cleaning,
we have arrived to that point in our tutorial, in which we have pulled the data, I've shown
you how to manipulate it with pandas, the beginning at least the introduction to data
manipulation with pandas, and now it's time to properly fix it. For the sake of brevity,
we are skipping a few parts of the process of data cleaning, especially you're going
to find it in this first notebook that we talked about basics, conceptual, missing data
with Python with NumPy. And we're going to miss a few other things. But I'm just going
to mention them. pretty generic, pretty general form. And then you can of course dig deeper,
you can check our courses if you want to know more about it. Usually when we talk about
data cleaning, where it's in from a more conceptual level, we're going to talk about a four step
process. The first step is usually finding missing data, which is the simplest problem
to identify from a data set when something is missing. So you have car sales data. And
there is a car that has no name right? Or there is a card has no price, right? So there
is an number missing or there is a category missing and there's a string missing. And
of course, each one of those is going to have a different meaning how to solve how to fix
data set that is missing data, it can be very simple. If you can just for example, drop
the record, if you can fill the value, right. So for example, the prices fill in these missing,
you can fill it with the average value of the sales data or something like that. Or
it can be very complicated if the volume is important if you can't move forward until
you actually find that missing volume. And it can involve something like picking up the
phone calling your ETL team asking what's going on that the data is missing. Or even
if you're buying the data, you have to call the vendor, ask them why their ID if you've
you're paying for that and there is data mentioning etc. So it can be a very political process.
It depends what's your use case. But again, from a technical perspective, identifying
missing data and fixing it is going to be extremely simple. Once you have fixed the
missing values, then you start looking for the data is assuming the data is not clean
yet in this process of data cleaning. The second step is when there are invalid values.
So you have for example, column that is price and there is a string
within it right here. We're expecting only numbers and there are strings in it. So then
It's not going to be complicated to identify, it's not going to be too complicated to fix
it. But again, we're increasing the complexity until a deeann of these data cleaning process,
we're gonna reach problems that have to do with the domain of the day you're looking
right. So for example, you have a column that is customer age, and there is a value that
is 170. Right? So that is not an invalid value, it's a perfectly valid integer. The problem
is that given the domain, right, but speaking about customer age, is highly unlikely that
a customer is 170 years old, right? So in that case, the vowel is completely valid,
there is no missing data, there is no invalid values, etc, is just about the domain. And
this is when things get very complicated, because in this case, that example of age
is something that resonates with all of us, we know about age of humans. But if you're
working in a domain, if you're working as a data analyst, in a domain that you don't
know much about, right, then you might not be able to judge if a value is invalid or
not. If I am working in a biology lab, and I have something like white cells count per
milliliter of blood, I don't know what's what it's a good value, or what's an invalid value,
right. So it's, it's something you need to know the domain. So that's usually the the
most complicated part of data cleaning, when you reach the limit of everything is valid,
everything checks out. And now I need to make sure that these value is valid for these domain
that we're working. So again, this is the spectrum that we're going to be revisiting
today. So to get things started, the way pin this works with no values is is it has four
functions, which actually there are synonyms, it's going to be it's going to be relatively
simple, just trust me on that. There are a few things first, everything that pandas does
in the process of missing values, is related to the way NumPy works. So again, we're skipping
it, you can go to that notebook, check it out by yourself. But it's extremely simple.
NumPy has these objects and n not a number to identify a missing value or no value in
Python world to have the non value. But again, in pandas and NumPy, we're going to use na
n none on there, or in this case, at the beginning, we have these two functions is no n is na,
which are complete synonyms, we're going to find also is no and no we have it isn't a
and they're also complete synonyms. So no n na for pan, this is the same. You can use
the one you prefer. Sadly, I like is na because it's the way I learned it. I think for my
students I usually recommend is no, because it feels more correct. And it feels more self
explanatory. So you can use the one you prefer, if you can use is no, I think that's going
to be better. If you get used to ease in a then you're going to be on my side, just do
whatever you prefer. So again, it's no one's gonna say true or false, depending if the
value is no or none, right? And of course not No, it's going to be or not na is going
to be the opposite. So not na have not a number is false, and not an A of three is true. If
you get to this first notebook, you're going to set all the false e values on the true
fi values in detail in terms of Python, anything that is not empty or non etc is going to be
considered to be truthy. So anything you pass here that again, is not an empty string or
a no is going to be considered a true fi value. So it's no not no or is it a and none an A,
they both work also with entire series or entire data frames, right? So it's not just
for one of Valley you can pass an entire series. And the result back is going to be if the
series is if the series what values in this series are either no or not no, depending
what's the question you're asking either is null or not null. So in this case, we say
which one is of the series are no, this is not, no this is not No, this is no so this
is only true. And the opposite for the following method we are applying are actually function.
And again the same thing works with not entire entire data frame. So something we
do usually is if you look in to not know unknown, a few hacks that we usually apply are the
count on actually this be the sum of all the no values or not no value. So we have this
entire series, we can say how many not null values we have. And if we sum those, not no
values. In this case, we're going to get a result out which is the entire the entire
summary Have the nod no bounds we have asked, and the same thing is gonna happen if we say
is no. So if I do here is no, some, we're gonna get how many novels we have? And it's
pretty much the opposite of this question is no. And the way it works is in Python bullions
are pretty much integers, they're ones and zeros. So every true Val is going to count
as one and every four is going to count as zero. So if you ask for the sum of a Boolean
series, you're going to get out the result of the number of truths that are available
in that series right. So, in this case, we have to know values we ask how many knows
value we have is know that some we get two out, you can use these tricks to filter the
data with a series. So in this case, we can say give me all the values that are not known.
Right? Just not know. Also, something interesting is that both for data frames are for series.
The not not no is no isn't a not an A methods also, sorry, functions also work as methods.
So in this case, we can say instead of PV dot know, we can say s.is, no load s, that
is no. So now, it gets a little bit more, a little bit simpler. But if the final objective
of these core as equals alzarri, s selecting only the boundaries are no no, was to drop
the null values, then there is a simpler form, which is dropping, okay, so in this case,
we can say s dot drop in a, and we're basically invoking the same thing that is happening
here, we're missing we're just excluding sorry, all the missing values in the series or the
data frame, because this also works for data frames. So what's the one, one important thing
to remember here is that all these methods are immutable, we are not actually changing
or modifying the original series, the underlying series is not being modified, there is a new
series that is returned. So if I invoke s, again, this thing has is not modifying their
series, you're creating a new series, and that's the one that hasn't, that doesn't have
the missing values. Everything we've said also works for data frames. So right here,
with these on a frame, we can say how many, right? The first thing usually is to start
with an info method, right? So we have info, and we see that there are in total, four entries,
four rows, we can also do a shape, if we need more information about the structure of our
data frame. So there are four rows, four entries in our index, column A has only two no no
values. So that means there are two values that are actually no no, sorry, no, there
is column B that has three nought non null values. So that means that one value must
be known, and that's for column B, again, so usually info gets you very close to understand
the structure of your data frame and how many values there are missing. The same thing happens
with some, we can just do df.is, null isn't a and then some, we're gonna get a quick reference
of how many null values we have in that given data frame. Drop in a works in the same way,
but there is a significant difference. The way drop in a works in a data frame by default
is by dropping any row that has at least one, no value. So these row has no value dropped,
these row has no value dropped, these row has two new values dropped, this is the only
one that it's not being dropped, right. So it's very harsh in that respect, you can change
that to make it to the column only, only keep the column that has no no values, and that's
by switching the axis equals to one. And there is also a way to select a subset or thresholds.
So only delete rows that have less than three valid values. For example, in that case, you're
going to use something like the strategy of the drop in a you're gonna say, drop the columns,
the rows, sorry, are the columns because it is also works for columns that have all the
values and no, or drop. The This is the default behavior, drop all the rows that have any
value in an NA or specify a threshold, which you mean by basically saying, I need this
amount of valid values in order to keep the rope it's the way it works. Now, which ones
to drop is which wants to keep based on the fresco. So once you have identified it No
values, it's extremely simple to clean them to sorry, fix them. So the first method we're
going to see is fill in a within a particular value, we're going to say from this series,
I want you to fill the blanks or fill the missing values with or fill the anaise. fill
them with numbers zero in this case. So these two are numbers zero, or, of course, you can
use any statistical method you want. In this case, we can use the main. Remember, this
is not altering the series, the original series is still the same, we're not changing it,
it's creating new series because all these methods are immutable. The following method is or this the following
way This method works is by passing a method which is for field or backward fields, these
are the possibilities. And basically the way it works is it's overflowing, all the values
top down, at least in Fairfield, right starting here, it's dropping this value here, dropping
this volley here. And dropping now three here, as this thing is a nun, it gets replaced.
So this thing is three now, which gets throw up here. And now this thing is three again.
So that's what we have right there. And of course, backward fields works in the other
way, starts with four and moves, it moves it here and then moves here, etc. You have
to be careful when using these. Because if you have no no values at the beginning or
the end, then you're gonna end up again, with no values because there is nothing to fifth
forward, right, this is the first volley you have India. And all we've seen also works
for Donna friend. So both boggler fail for field or both in terms of rows for feeling,
right, so we have, we have these, these data sets. So we do for field row base is going
to be one to here too. And then five. So that's going to be for field x is one, if you use
for field x zero, then it's a vertical filling, right? It's going to go here, one 130 30.
So that's for the column, that is y here, one 130 30. So it's either for filling in,
in, sorry, this direction for failing, or it's going to be in this direction, depending
on the axes that you are passing. And actually, let me we're going to put the correct forms
with axes equals zero, it's going to be columns, it's going to be visit direction with axes
equals one, it's going to be row based. So it's this direction, right? So we had a no
volley here, that got fail in this way. Okay, moving forward, we what else we have, we have
here, checking for values. And we've pretty much seen this already, you can use the is
know, the sum method to get how many values you have. And there is also any an old, which
will give you very quick. These are usually called Boolean tests, you can say ask if there
are any values are valid, or all the values are valid is just to build more complicated
queries. So so far, so good. So the process we said was at the beginning, we were fixing
missing data, missing values, there is nothing in there. We have read a data frame, where's
our data frame right here? We have read our data frame from CSV from a database, and the
value is missing. No, there is a hole in it. So we have quickly identified it with isn't
a or is no, we were able to drop the ones we didn't want to keep dropping a or we were
able to fill the volume we wanted to fill fill a name that was simple isn't a drop in
a fill in a what happens when you're cleaning data that actually has a value, so there is
no nothing missing. But those warnings are invalid. So for example,
here, the sex column is a categorical column that only accepts an on f. d on question mark,
those are invalid they are, it's very simple to see an invalid value here because it's
completely out of the scope. The same thing as we have, for example, question mark in
the age column where we have we have a string in the age column, it's very simple to identify
that, how we're going to clean those. Let's start with sex first, because it's simpler
in this case, the first check we can do is with either unique or with volley counts,
I'm going to use value counts. We've seen this method before. It's a quick summary of
all the unique values you have. And in this case, volley counts also gives you a total
count for those values. How can you fix them? Well, there is a replace method which is extremely
intuitive. You can just replace in this case, we're changing all of these two F's and The
End two M's, and it can work in multiple columns. For those volleys, that again, we said were
more complicated to fix, like, in this case, we know age, in this case, is 290. And we
know because we know the domain, that 290 as an invalid age for a human. So we will
need usually in those cases, we're going to need more complicated fixing, and it will
involve more programming, that's the reality, you have to be better coding. In this case,
we know that these volley is invalid, because it's probably an extra zero. So all these
values, you're pulling a CSV with ages, and there are a total of 180 290 32 320, for example,
invalid values out of 100, right in the 100 places. And that's because there were typos
when they were creating the ages. So how are you going to fix that? Well, in this case,
it involves a little bit more programming, we're dividing everything by 10. So also, something that may be useful is dealing
with duplicates. And we need to first define what's going to be a duplicate value. So this
is, this is usually a little bit more political, if you want, you have to define what's going
to be a duplicate. In this case, we have a series that contains ambassadors, and each,
their master is the index, the country of the ambassador is going to be the value, right?
This is usually the important part. The rating here says the word conducting a party, and
we want to invite one Ambassador per country, we don't want to repeat ambassadors, ambassadors.
So in this case, what's going to happen is that these two in our humanize at least, we
can click clearly and quickly see that these two belong to the same country. And these
three belong to the same country. But here again, we have to define which ones are the
duplicate, if you want, and which ones are not duplicate. So for example, maybe we can
say the first one is duplicate, or we can say the last one is duplicate. So this is
the first one not duplicate, or actually can say this, the last one is one, and when I
bite, it's not to duplicate. So we're going to have political rules if you want for each
one of those. So let's see the duplicated method and the way it works by default. By
default, duplicated method is going to return true for duplicate for all the it's it, I'm
going to invert it, it's going to not treat it as a duplicate as the first instance that
it says. So the method is actually walking top down right now saying, Do I have friends?
No, I don't have friends. I'm going to keep it here. Because it's the first time I see
friends. Do I have the UK? No, I don't have the UK, it's just gonna keep it here. Then
it sees the UK again realizes the UK is already there, too. It's already present. So this
one is going to be considered a duplicate. Italy is here, it's fine. The first occurrence
of Germany, it's fine wrightstown, Germany, but then it says Germany two more times. And
it realizes that Germany was there. So those are now duplicates, right. So the way it works
by default, we can change that and change it to last to the last element is not considered
to be duplicate, and the other two are considered to be duplicate. And the same thing here.
Kim, here is the one consider duplicate. So it's either top down or bottom up depending
the way the parameter you're passing, it's either keep default or keep last, or you can
be a little bit more harsh on say everything duplicate it is actually to be needs to be
considered duplicate. So these two are duplicates, and these three are all duplicates, as you
can see, right there. Similar to the duplicated method, which pretty much tells you which
values are duplicated, it's it helps you identify them, you also have the drop duplicates, and
in this case, what this method is going to do is basically the same thing as before,
but dropping all the values are checked for true, right if the method is if the value
is missing, it's gonna just drop it. And the same rules apply default, last and false.
For subsets in this case, we have Ace, we have multiple, we have multiple players in
the data frame. But what happens is that these player Colby is present three times for humanize
we see Kobe three times. What is going to happen here is that the The way we're going
to think about duplicates is by understanding the correct subset that we should check. In
this case, Coby plain as sn SG is duplicated two times but COBie, playing us in SF could
be considered a different player if you want, because maybe it's a different season, or
it's a different, a different position they played. So in that case, we need to pass What's
this subset that we are going to consider duplicate, only check for the column name,
or check for the column name on or not check for the column name, which is the default
is going to check the entire data frame. And when that happens, then these two are considered
to be duplicate. So these one is a duplicate with this rule, if we put keep last, sorry,
keep false, both are going to be considered duplicate. So this second occurrence is the
duplicate one. And the last one is a completely different row, because the the value in position
is different. That's the way it works here. Moving forward with more cleaning of values,
we're going to talk about string handling. And this is a very neat feature of panelists,
that special types of columns will have special attributes. So given the column type, so df
info, which is an object, which is a string, right, in pandas, that all the strings columns
are going to have these special attribute which is str, all the daytime columns, something
we're not going to cover, but you need to know, all the daytime columns have a.dt, Math
attribute, all the categorical columns don't have a.ca t cat attributes. And those attributes,
str DT cart, they have a special methods associated the domain of that column. All the methods
associated with string are of course, we're string handling, or the methods associated
with DT r for data handling. So in this case, we're going to review all not all very good
subset of the string methods we can apply. And something interesting is that all these
methods have a very good have a lot of relevance. And they're related to the ones in pure Python.
So if you have a pure Python string, there's a split method. There is a contains method
or I don't know if there was a contain actual, it's actually, I think it's the in operator,
but there is a strip, and there is a replace, right, so most of the methods under the str
attribute in pandas have, have an analogy in the standard library of
string handling with Python. So starting at the beginning, this data we have, I'm going
to delete this this data we have, what we are going to do is split the values right
by an underscore. So in this case, that's what we have, we have split all the volleys
with that underscore, and we're going to use the special attribute is expand, expand sorry,
equals true. And what it's going to do, it's going to create a data frame out of that.
So we create a data frame with 70 columns. And this is what we have now. So we can keep
applying methods. So for example, contains or content contains, regular or contains with
regular expressions rights for you to see the power of it, we can just strip replace,
and we can do even regular expressions with replacing so we could fix something like this
question mark in a string, we could fix it with regular expressions if you know how to
handle them. And finally, something that is going to be very helpful when you're doing
data cleaning, is looking at the data from a visualization perspective. data cleaning
has a ton to do with statistical understanding of your data to when a volume is considered
an outlier. For example, it might be invalid, and you want to claim it. So but that's a
lot more about statistics. And this case, I want to show you very quickly, the mottled
leave library, I've been promising for some some time now, the mapa lib library. So far,
we've accessed it directly from pandas, from pandas, or we're doing a data frame dot plot.
It's these library mapper lib is the one backing all those methods and we're going to see how
to use it directly. Now. The model live library has two important API's we're gonna call him
one is the one that I don't prefer, which is the global API, but it's the most common
one. It's the one you're gonna find around the global API. And the second one is the
object oriented API. So it's around here. And usually there are there are ways it's
just two different ways of doing the same thing. Okay. The global API is an API that
it's in part inspired in MATLAB. It's been around for a long time on sadly Most of the
answers you find in Stack Overflow tutorials and books will be using these global API.
The way the word the one I prefer the most. And I'm gonna explain you why in a second.
It's going to be the object oriented API. But I want to show you both. So you have a
reference. If you follow me in this feeling of preferring the object oriented API, you
will always have to translate global to Opie. Why is it considered a global API? Well, we
have imported matplotlib.pi plot as PLT. So we haven't imported the whole module, the
whole Python module, depending how much you know about Python programming is going to
make sense or not. We have important the whole module. And now what we're doing is we're
invoking PLT dot figure. And finally, and then we're going to do a title. And then finally,
we're planning two things. We're plotting x, our plotting x squared and minus x squared.
And why is this global because we're invoking functions that are at module level. And there
is an object, the final plot, that it's being modified by these very generalistic and global
courts, right. So by by doing these call right here, I'm modifying the final result of the
plot. Let me show you a more complicated example. So you see the problems with the global API. If you look at these line, if you could delete
everything, let's actually delete everything. What is this line doing which plot is affecting,
you do not know, there is no object oriented way of saying in this second plot the plot
on the right or the figure on the right, or actually the sub plot on the right, I want
you to plot this thing, you're just saying it to the entire module. And depending the
order that you set it, is where it's going to land, that particular figure where it's
going to land in which plot, it's going to lend. Again, it's a global API. So we start
saying, I'm going to create a figure, trust me from So from now on, I'm going to start
drawing on it, there's going to be the title. And hey, by the way, it's going to have one
row, it's going to have two columns. And I'm gonna start drawing in the first plot these
one right here, these one right here on the left, okay. So now I have kind of activated
if you want that plot, it's active. So now I'm going to start drawing on it. So every
action that happens after this line is going to be affecting these blocks, these blocks,
right. So then I plot x and x square, I plot this vertical line, I put a legend, I set
labels, etc. And at some point, I just stop and say, Hey, now I want to switch the plot,
I want to now start plotting. Sorry, I want to start plotting here in this second one,
because I have just changed that the first line these one. Oh, sorry, the way it works
is by saying the first row, second column, but second plot. So now I want to start plotting
in here, every successive line will affect that line. And again, you can see that understanding
a code, given the order that the order in the sequence of lines is very hard. If you
have to debug a report that has a plot that takes 100 lines, then you have to keep in
your brain, what's happening top down, a different approach is going to be the object oriented
approach, in which we're creating a figure. And we're creating axes. So in this case,
we have in this case, we have right here, one entire figure in red. And we have in here,
purple, we have two axes. So these axes one, and this is access to so we have two axes.
We're going to create those using an object oriented approach. And we're going to keep
references to them. So we're going to say later, to these blocks to these artists, sorry,
I want to plot something. And that will be very explicit, it's going to be an object
oriented way. So the first thing is creating the figure on DCE. The axis in this case,
we have just one axis, that's it, but you can have more and then you say in this axis,
I want to plug this thing in this axis, I want to pull up that thing, etc. When you
have multiple axes, so I could show you. I'm going to go back again to that in a second.
But In this case in which we have four axes, right, so we create one figure. And it has
four axes, we do it with this subplots, method saying and rows and columns. Now we say to
the axes number one, I want to put this thing to axis number two, I don't want to put that
thing, right. So it's 1234. And now it's a lot more explicit, it's not depending on the
order, I could change this order, that doesn't matter. They're that the results are gonna be the
same oxes number four has yellow, regardless of the position that we're following. So the
map will live. And now that we have clear out the differences in both API's, maple leaf
has this very simple plot function, or method, depending on sugar enter global, that we'll
plot something you specify. In this case, we're passing all the values in x and all
the values in y. And in this case, we're passing a given line style, this can change with these
type of syntax, you're saying, I'm plotting this thing in X, I'm blowing this thing in
y second parameter and why. And I want you to use a straight line, it's a straight line,
yes, with this marker, the dot and in green. So this is if you are very familiar with it.
If you're very familiar with my bullet you can use to send links in other games, you
can just say line style market marker, sorry, color specific keyword arguments for each
one of those. So do we only have line plots in APA live? No, of course not. We have a
huge variety of plots. And by the way, there is another one here, if you want to see more
events are grids, you can create these grids and put different things in it. And again,
not only land plots, one good example is a nice scatterplot. So basically, we're plotting
X and Y correlation. And there is also our value, our color map, right. So given the
volume, there is going to be a change in color. So these kind of lets you plot three to four
dimensions of your data, the volume x, the volume, y, the size of the bubble, and the
color of the bubble. So where you're pretty much encoding four dimensions in just one
figure, right. So in this case, we're just using two different scatter plots, there's
more information here, we can also block histograms, that we've very quickly seen that with pandas
with pandas is, is very simple with just plot type histogram, current histogram hist, actually,
you can look it up in our previous lessons. So just go back into the index in the video.
And the histogram is extremely simple just takes the valleys you're plotting and how
many bends you want, or some more advanced arguments here, like the alpha level, etc.
But it's simple. And similar to the histogram, you can also create kernel density estimator
diagrams, which is very similar to distance to simulate if you want a continuous distribution.
You can combine these plots if you want, in this case, we are creating the plots were
plotting a histogram. And they were plotting the lines and they were plotting our changing
limits. But that's pretty much it. And you can also create bar plots, right? So in this
case, we have PLT dot bar, or here we have two bars are stacked, right? That's the different
way to look at it. And finally, check in outliers. You can always plot histograms or box plots,
right? So box plots are also a nice feature to have in here. So this was all with data
cleaning, we're gonna keep moving forward this tutorial, I want to mention one more
thing here. And it's there are notes here for kind of a task that you can follow with
data cleaning, which where we are identifying where indentifying missing values in given
positions with is known as an A. And right here, we're looking into more detail about
some statistical properties of the data, in case we need to clean it. Okay, so this is
little bit more events. And it's it's related to the concept of cleaning data given the
domain. So the statistical analysis can tell you that this value is an outlier. For this
distribution, the value might be valid. So for example, a human being is 90 years old.
That's, that's valid, that's a valid age. But if you're analyzing data about high school
students, and a human that it's not a year soul, it's going to be completely invalid
or it's going to be an outlier in that distribution. And you can treat it as such You valid valid
and clean it out, remove it, for example. So that's, that's deal a little bit more with
the whole statistical analysis you can follow here, it's a little bit more advanced for
the scenario. So let's move forward with the rest of the videos. Now it's time to get into more advanced features
of pandas to import external data. So we've seen already in our real life example, the
way we can import data from CSV files, and from SQL databases, right, we had actually
those two lessons, the objective of these part of the tutorial is to show you how you
can improve or get into more advanced use cases of importing data. So we're going to
start for example, with csvs, and text files. And again, you've seen it already. But here,
we're gonna give it an extra twist. So I'm going to show you more advanced features.
And for special use cases, txt files, CSV files, is, conceptually speaking, a CSV file
is a text file, it's just human readable text, right? That it's encoding information. The
idea for CSV file is that it's tabular. Right? So it's a plain text file that contains tabular
data in it, and it's separated. csv stands for comma separated, but it can be separated
can be anything, we can see more examples later. But basically, the idea is that it's
a text file that it's tabular into in a tabular format. So though, both CSV files and text
files will be read with the same method. So to get things started, I want to show you
the basic way we import will read data from, from from external sources using Python without
even starting yet, with pandas. So you don't need to know this, it's usually it's usually
productive if you want for data scientists or data analysts to understand a little bit
more how fire reading and writing works in computers, because there are multiple, multiple
concepts align, here, they evolved, operating systems processes your language, right, it's
not same thing to read a file with our or with Python or with another language. So there
are multiple concepts here. And even though pandas in this case can make it simple, very
simple to read and write data, you can get a little bit of a more advanced use case,
if you know the internals of again, both the operating system processes on your language.
So this the way we read data with a reader file, sorry, using pure Python, we use a function
open. And in this case, we're using a context manager, just a security feature, again, related
to to the advanced usage of reading and writing files. But it creates a file pointer, right.
And with a file pointer, you can then use the very simple API x point post. But they
but that pointer, which is something like red line, red lines, read a number of bytes
or characters, or you can just even trade FP as an iterator, just do a four line in
FP. But basically, we're going to do something like this, we'll start reading data from top
to bottom, just a month to, I don't know, we hit I've given in this case, we're doing
it just for a couple of lines. What else we can, it gets very difficult when you're reading
text files to process them, because it's usually hard to parse the structure of the file. So
it's not the same thing to have a funnel that is separated by comma separated by colons
separated by pipes, spaces, etc. So you're gonna see that once you want to get a little
bit more, I don't know a little bit more with an advanced usage, right, or a little bit
more fancy your calculations and and the way you parse the data, it's gonna, it's gonna
get harder. So that's why we're going to use pandas, or I'm going to show you in a second,
this is the module that is part of, of Python. So this is the file that we're going to be
reading. It's the XM review file, and I'm going to open it. And even though it doesn't
look like a CSV, it isn't either CSV. The difference is that here the separator is the
greater sign, it's not the comma, it's a greater sign. That's going to be what marks the elimination
between different fields in our CSV file. So we're gonna use the CSV module. And the way right here to parse the data using
that module is by passing a special delegator, right? So that's gonna be the type of work
you might need to do when you're parsing the data. It's not the same thing to have that
limiter dates a greater sign. It's not the same thing to have numbers for example, that
are enclosed in quotes. All those things right will change the way you work on all days is
going to be abstracted away by the pandas module. So to get things started, again, with
pandas, at least, pandas has multiple read underscore something methods that will work
for different sources, right. So we saw already have read sequel we've seen read CSV, there's
also a read HTML to directly parse information from a table, it's literally you can just
you pass a website's going to read information from a table, or read Jason read more advanced
formats like pocket, or Stata, etc. And, again, each file format will usually have a correspondence
in pandas, it's, I've never had the chance to rewrite my own stuff. To be honest, the
same thing is going to happen for something like Excel, which might need external modules,
it's not directly provided by pandas, but by installing those modules, you can easily
incorporate Excel files in your day to day work. So the read CSV file methods already
has a ton of parameters. So this day, the main characteristic of all these rate something
methods, given the amount of possibilities you're going to have with these files, there
exist a ton of different ways to customize the method invocation. Alright, so again,
CSV files, we saw, there are multiple things happen. csv is a passage that have a header
don't have a header, different delimiters different and closing of strings or numbers,
multiple things, blank lines, etc, multiple things are going to happen. And that's all
you're able to customize all that with the read CSV method. So this is the reference
of all the attributes you can pass to it, usually something
that I do, and I do this very often, and I use pandas a lot, and I still do something
like read CSV, and I get the documentation right here, to look into the, the parameters
that I think I need to pass to my particular use case. So keep an eye always in the docs,
because it's impossible to remember all the parameters in the CSV. So in this case, what
we're gonna do is something very interesting is we're gonna parse a CSV file, but it's
not located in this computer, it's not locally available in the computer. The CSV file is
these one right here, which actually is the source, if I get the raw version is this thing.
So this is CSV file, what I could do here is download the file, right, so just do File,
Save, get the CSV file on my computer uploaded here, right, so just copy and paste here,
drag and drop it here. But actually pain this has this nice characteristic that it will
read a CSV that it's either locally as we did with BTC market price, or you can also
do it remotely, it's automatically going to download the content of those files. And it's
going to provide, it's going to save it in memory for further usage. So there's a very
neat feature. And again, this is the the CSV file that we are using. And again, the same
thing, if it's a local file, it works in the same way. So a few features you've seen already,
in this case, we can do Heather known, if you don't want to treat the first row as a
header. Or what about missing values, we can treat some of these values like a question
mark, or like an exclamation mark, or dash etc. us not a number, not a value, right,
so it's a missing value. And now any of these values we have passed, will be transformed
into another number for easier and easier process cleaning, we can pass names, which
is going to be basically the column names for each one. And we can also specify column
types, as you can see, right there. So now the types are going to be float. And object.
We've done this already in one of our lessons, we are parsing the time and there you go.
So putting all together, we get to these advanced forms of reading csvs where we're passing
column names were passing types, were asking to read dates, were passing no values, Heather's
etc. So this is a pretty common thing we are doing. So what about XM review, if we try
parsing this thing, we get this very ugly format. In this case, they put the parameter
to specify the what we used to call delimiter in CSV is now set from separator so the separator,
it's going to be the greatest sign and that just works as it needs. So, a few more examples
you can check on here the most important part is following right, the documentation to find
those particular use cases that you are having so for example, some Like skip blank lines,
or whenever there are like empty rows at the beginning, right. So if you have empty rows
at the beginning is something you can also say skip rows. So you don't need to parse
those out, it's not going to break, etc. So that is all part of the read CSV file. And
to finalize these part, at least csvs, I'm going to tell you something that applies to
pretty much every other data format. As you have a read something method, there's going
to be a to something method, it's basically the process of writing. So you can do read
CSV, or you can do to CSV. So these CSV that we imported from the external source and the
remote source, we can just do to CSV and it's going to store it locally. Alright, and there
are multiple options also to pass the CSV delimiter, or actually the separator, if you
want to include a header if you want include an index, etc. They're pretty much the same
as the other one. But the idea is that for every read something method, there's gonna
exist a to something method that it's basically the process of writing. So let's move forward
with a few more data formats. And interesting, we're gonna get to read directly HTML pages
in just a couple of minutes. And now it's time to read data from databases. We have
already done that in our real example with Panis part of the tutorial. But I want to
show you a little bit more details details for you understand how data is being processed
in case, this is a common scenario for me importing data from databases. So the libraries
you will need first thing, depending on what database engine, you're using Postgres, MySQL,
Oracle, etc, you will need to install different libraries. But the API's, once you have installed,
those libraries are going to be the same. There's actually p Ep from Python that actually
defines the interface for databases, libraries, unpin, this can work with pretty much any
any database of these SQL common database that comply with that interface. In this example,
we're going to use SQL lite because the database right here, there's nothing, no server to
connect, etc, is extremely simple to get started. And the example we're going to use, or the
danavas example we're going to use is actually different one from our previous video is reading
in the previous one, we were using circular, in this case, we're going to be using chinuch,
which is smaller both in structure and in size. So it's going to be a little bit simpler.
So to get things going here, the same thing that we did with our previous part, that was
how to read data from files, I show you how to actually read data using Python. So forget
about pandas for a second, I told you, if we go back again, to the beginning of time,
there was no pain this, this was the way we were writing, finance, open FP, FP, the red
lines, etc. So I now want to show you what predates to pin this, what was the default
way to read data before paying this, which is with the regular again, interface from
Python. So the way it works is we're gonna import SQL lite three, we're gonna create
a connection. And now with this connection, we have these common interface that again,
it's common for pretty much any other database that you're used to. And the default behavior
is we're going to create a cursor. And we're going to execute queries using that cursor.
In this case, we're going to execute a regular Select star from employees limit, Fox will
want to have five, five records out of the table employees, once you have executed a
query, it's like they're waiting, you can do a fetch all to get all the results of that
query. And here are all these results. As you are noticing this is the result is a list
of tables. So it's not extremely useful. Now, if you combine it with pain, this you can
just create a data frame out of that info. And we're close. It's not perfect, but we're
close. So let me show you now before we were gonna close it Kurt Dickerson on the connection.
Let me show you now how we work with pandas. With pandas we have as we have a read CSV
method, we also have a read, see as read SQL method, and in this case, what this method
is going to receive is the first parameter is going to be the query that we're passing
and the second parameter is going to be the connection. That's the object the connection
object to actually issue the connection by panelists. So it gets a simple as writing
the query. And now everything has been imported into a data frame, including column names
and all that if you want to get a little bit fancier, you can either specify the index
column, there's going to be use, of course as a index, and also what types to parse for
a specific column. So now we have pretty much all the work down. So we're going from something
very manual as processing things with a coarser etc, which might also be as low to using pain
this to do Actually imported data from the database. There is actually a caveat here
that I'm going to tell you is kind of a very deep detail of the way pandas works, and is
that the read SQL method is actually a shell for two other methods, read SQL query and
read SQL table. Alright, so right SQL table on read SQL query, when you're using read
SQL, it's actually kind of forward in the work to either query or table, or an SQL query
is the default behavior, what we've done so far, so in this case, it's just going to issue
a query and the connection is going to read it for you. In contrast rate SQL table is
can I read an entire table, you just pass a name, and it's going to automatically give
you all the information for it. So in this case, all the column names, etc. So it's a
lot simpler to read an entire table, the only thing to keep in mind is that to use this
method, you need to install these libraries, SQL alchemy, and the connections generated
from it. So in this case, we create an engine on we create a connection objects, and now
we can pass an actual auction object sorry for pandas to do it. So again, it's pretty
much the same, if you find yourself doing Red Star from this table, Red Star from that
table, it's a lot easier just to write SQL table, and that's going to do it just advance.
As we saw that read CSV files hard to CSV, sorry, read CSV method had a to CSV method,
the same thing happens with read SQL, there is a read SQL and the results are to SQL,
what's what it's going to let you do is get the from the database and write it down into
a database table directly. So it's going to also receive the connection, right? So to
SQL, it's gonna receive what he will name of these data frame, what table name is going
to be, and a connection object. Now something to keep in mind is that to SQL has an important
parameter, which is what happens if the table already exists, that in the default way, it's
going to fail, just going to throw an error when you are trying to save data to a table.
And this makes sense, because as data analysts were usually reading data and processing it,
we're not so much writing it. So we want to meet make sure that it's not by mistake. But
if you do actually want to write data, you can just change this parameter if exists something
like replace or append. Usually, we're writing to intermediate intermediary table tables,
again, you can choose either to replace the whole concept of the table, be careful here,
or to append, write, just write it a dn of the current table. So that's just for to see.
So this was the way to read data from databases, of course, we're not touching on anything
like SQL and all that, that it's a lot more advanced, it's just for you, if you already
know SQL, if you're already working with databases, you can pretty much copy and paste what we're
doing here. And you're gonna, you're gonna get your data import into Python. So let's
move forward to read some HTML files. And now very quickly, I'm going to show you how
to read tables or data frames directly from HTML web pages. To be honest, this is a simple
method is going to be just read HTML, but it depends a lot on the structure of the web
page. So if it's not well structured, or the tables are not correctly created, you're going
to have issues and you will have to do a ton of data cleaning. In my experience, whenever
I try to parse a table from a well structure site like Wikipedia, or some stats site, it
usually works very well. And it's a very quick way of hacking. You know, whenever you have
questions, you know, like, I don't know, I need to know the GDP of countries. Instead
of looking for a GDP data set, you can just go to Wikipedia page, there is usually a table
there, you can directly parse it and you are done. So again, it's it's a relatively simple
way to get some data for quick hacking and exploration. The way it's going to work is
we have these HTML creative. It's just for testing purposes. To get started, usually,
of course, you will try to read something from a live website. So you're going to pass
the URL to the read HTML method. And the read HTML method will download the content of the
page and parse it. Let's suppose we have the the content already the HTML, and this is
what it looks like. This is a exactly the same HTML we have on top, I'm just displaying
it here in a book. And what we're gonna do is we're gonna invoke the method, read the
HTML. And the read HTML method is going to parse the entire HTML and look for multiple
tables, not just one site will potentially have multiple tables, even if you don't see
them. The is a common way to structure things in HTML to use tables. That's why it's going
to pause multiple tables. In this case, we stored them all in a DFS, multiple player
like multiple data frames. And we see that there is only one. So in that case, we're
just going to get the first data frame. And it has correctly parsed what we had before
just working in the same way. The same is going to happen with for example,
things for headings and all that if the table doesn't have a header, it's gonna automatically
right understand.in that case. So that's pretty much as we know it already. In this case,
what you're going to see is what I told you before about data cleaning process that these
table does not have a header like the previous one that has a T head to head attribute, in
this case, a header is just another row. So that's why read HTML is going to have issues
and you have to provide a little bit extra information. So let's see another more realistic
example. And we're going to parse data directly from a website, let me tell you here, just
just for educational purposes, you always need to understand if you have if the data
is public, so you can actually parse it. Again, for Wikipedia, at least what I do, the content
is created comments, so you can get a hand on it. There. What we want to show you here
is that a very complicated table that has multiple headers, etc. So that's why we're
using this example. So we're gonna get the URL, and we're gonna directly do NBA tables.
Equals read HTML, the only table in this page is this one, the large one. So that works.
And now we're gonna get NBA is going to be that and we see that the all the players in
this case have been parsed. What about something else, let's actually open this page right
here to Wikipedia, for the Simpsons. And here, we will probably find several tables. See,
we have one right here, this one. So I'm going to import it. We have 27 tables, again, you
don't see it. You don't see them, sorry, but they are there. And the most important one
is the one we care is these one right here. So the problem you're gonna have with this
table is that each using both columns, pans and rows pans. So in this case, this column
here is pans for one to three columns. And these row here stands for 123, at least three
rows. So that column spans results in these very ugly data frame, and you will need a
little bit of extra cleaning. So that probably you're going to find with HTML tables that
usually there are things that are not well formatted for machines that are formatted
for humans. So for example, in this case, we have this header repeated, when you parse
this data, you're going to find that every 20 rows, there is going to be header row,
and you will have to clinic every for in this case, to enter rows, you will need to drop
it you will do something like df the drop, let's see, actually, if we can see it haven't
tried this, but let's just do it like that head, and you're going to find 25 Records
now. So here, record 22, we find that, Heather, so what we're going to do is you will need
to do something like df the drop df dot drop, range 22 starting in 22, up to the F the shape,
zero, right, these many rows plus one plus one and every 20 rows, I don't know this is
going to work, just run it. Hope didn't it didn't even work. It didn't compile. Oh, this
is NBA actually. There you go. So maybe it works, you can check it. But what
I'm going to say is, again, there is some cleaning to do because HTML pages are optimized
for humans, not for machines. So usually, it's going to take a little bit more time.
The good news is that there is usually a service associated that you can consult. So for example,
there is a Wikipedia API that you can use instead of a page. But again, sometimes just
easier to pull the data directly from Wikipedia. So that's it. You can also write data to CSV
or of course or HTML. That's pretty much the standard. As we've said, this is up all we
had for the read data portion. And we're gonna move forward now with a few other methods,
especially what we call data wrangling. We're going to do a little bit of grouping and keep
moving forward with our tutorial. We have decided kind of last minute to our final source
of external data that it's going to be an Excel file. It's just a common Excel files,
you know it, because we imagine that you might come from an Excel backgrounds, you can just
export the data you have in your Excel files, Excel spreadsheets, and load them into Jupyter
Notebook and start working with them with him this so you can try things out and kind
of draw the pearls in between Excel and what you do with pandas and Python. So the first
thing is, an Excel file is not a text file. So if you try getting the content of it, it's
not a text file, it's not so simple to parse it. So that's why it's gonna require external
tools that they already installed in notebooks AI, there might be a student's holding goal
up, but it depends on your computer, how you're going to install it. So just keep in mind
that there might be issues when importing data from Excel, if they if there is low compatibility
between the library you're using another spreadsheet version you're using. But without those without
getting into those details, there is read Excel method, which pretty much takes care
of everything for you has different parameters, like the finding the the sheet that you're
reading from, of course, the path, etc. So we're going to start reading these file, which
is products file that has three sheets, products, descriptions, and merchants. And it's actually
something we use in an Excel file to sorry, in our data analysis, from Excel to pandas
course, to show how to merge data and all that. And from this file, what we're gonna
do is just read Excel. And what you're gonna see is that it reads the first sheet of the
Excel file, I mean, a data frame is just corresponds to one sheet only, right? And the first one
is product. So that's what we are writing. There are different behaviors for it, you
can change the way you parse, Heather's etc, you notoriety defining and specific index,
that's pretty much everything we have seen. So far, it's selecting specific shifts is
simple, just pause the sheet name, and you can share the rate story either products,
merchants, whatever is available in the current Excel file. There is another format or a new
specific class that it's a little bit more advanced. But it's the Excel file class. So
it's not, as we were doing here, right, Excel directly is going to read thought Excel file
into a data frame, but you're going to instantiate this Excel file class, with the parameter
being the file name. And now these files gonna have just a reference of everything you have.
In this case, we can do for example, sheet names, it's going to tell you how product
descriptions merchants, there's a little bit more explanatory data analysis. So let's say
you can't use Excel to actually see the contents of the Excel file, this is going to be helpful,
you're going to first parse the Excel file, get the sheet names, and a little bit more
of an understanding of it. And now we can say from these files we have previously parsed
right here or instantiated, we can parse the product, the product sheet, and that's going
to get you that that frame. And the same thing is going to happen with all the parameters
weekend pass, they are the same as read Excel. Finally, you can see that the results are
to excel file. And it works pretty much the same way as to CSV, and decide if you pass
an index or not. And also you can define if you're going to pass a sheet name or not,
are just going to be the default one. So as you can see, getting your data into a from
an Excel file into a CSV, data frames array is extremely simple. There are more customizations
to do, let's say all your file is shifted array, either rows or columns, you can change
that with Star row or column that's going to work, too. So that's pretty much the only
thing we need. If your writing process is a little bit more complicated. Like for example,
you want to write specific sheets in our multi sheets. Excel file, you can use what we call
an Excel right and it's also part of fantasy, you instantiate the rider, and then you can
start the ride process saying which shades you want to ride with each one of those, that
friend. So again, reading and writing data from on to Excel files is relatively simple.
It all depends on the libraries are installed. It depends on on what libraries you have in
your current environment, if it's windows or if it's a Linux slash slash mark, the documentation
of PD dot read Excel might have more details for the given platform
that you have. So let's see if it names per document, if it's not here, it's gonna be
in the pandas documentation, but there might be a requirement
For each one of the platforms, that pan This is supported. So just check it out, check
for your own for your own platform if you're in Windows, Mac Linux, how to get those libraries
installed. So in case you're just getting started with
Python, and you might come from another language, the objective of this quick section is to
show you Python. Ideally, in under 10 minutes, I think it's going to take a little bit more.
But there's a very, very, very quick reference of Python, again, just the high level features
of the language, how to use it, how to code functions, how to import modules, variables,
data types, collections, etc. You can just scroll through this notebook, if you want
to take less time, I will be providing an explanation on top of all the topics, but
there's a very good reference of the entire language. So to get things started, Python
is an old language period. It has card, it has caught more attention in the past five
to 10 years. But it's a very old language. It's even older than Java. It's up here in
1990s. And it was created by this person good by Guido van Rossum. And it's an important
actor in our ecosystem he is used to be I think he still the one deciding discussions
etc, when it comes to defining features of the language, etc. Python is a high level
interpreted dynamic language. And this means a tone actually, if we read these entire sentence,
interpreted high level, general purpose, this is basically high level programming language,
it's object oriented. And it also includes functional attributes or functional features
like functions as first class objects, etc. And it also, of course, it supports imperative
programming. And it has a wide variety of applications, you can do web development with
Python, you can do scripting, it's a lot use for system development for configuring machines
in general. And of course, you can also do data science, it has multiple applications
has a couple of interesting features like indentation, for defining blocks, etc, that
make it and very good language to get started with programming. So if Python is your first
language, you should be comfortable with it. It's a very good idea for me, it wasn't my
first language. And I hope it was, it wasn't. But I, I have taught people programming with
Python as their first language. Seriously, it's always been very good for them, because
Python doesn't have weird things like my have in JavaScript or Java. So it's a very concise
language and consistent language to be honest. So let's get started very quickly. First of
all, when you're going to install Python, your own computer or you can use notebooks
AI or Google call up. But if you're installing in your own computer, you might see that you
can install either Python two, or Python three, or actually, if you're reading tutorials online,
etc, you might see Python two and Python three, the reality is that Python two was deprecated
in 2020, so the you cannot you should not use it anymore. There are still ways to install
Python two, but it was deprecated. So you shouldn't use Python two, you should stick
with Python three, which is the evolution of the language. So ton of fixes from Python
to the bay where, where things happen in the language and used to confuse beginners. So
that's no longer a problem. Python three, again, is what you should use, you will read
in multiple tutorials, etc. What they are using Python two, you should try using Python
three, and sometimes the code will break, but the changes to fix it are not very hard.
So to get things started here, I will be drawing the problem of this and with regular syntaxes.
For example, this is the way you will define a function in for example, JavaScript. And
it's also very similar to something like C or Java based languages, the function keyword,
curly braces, etc. So I will be drawing a parlors and with these sort of languages.
So to get things out of the way to defined function in Python is in this way. And the
main characteristic of this language is that the way we're going to define blocks is by
Using different indentation levels. So this is a valid function in Python def is the key
where we use the name of the function the parameters it receives. And the way to define
that the body of the function is by just indenting. Everything one level to the right. Usually,
this is just for spaces. Another example is an if else statement. So
if this thing happens, do that if else do something else, right? This is JavaScript.
In Python, again, it's defined by indentation. If this thing happens, we indent one level
to the right, do this else do something else, if there was another if statement here, if
I don't know, language, ends with something like I don't know, three, then do something
else. Print pi, three, for example. So we're indenting everything to the right, every time
we start a new block, whenever the block finishes is just when you go back again, print this
as first block, right, that's the way it's going to work it by indenting. Our blocks,
this is very good, because first, we don't have debates of where we should place the
curly braces. And also, because it makes it a lot more readable, it's a lot easier to
read these code because there is obligated obligatory indentation to even make the code
work to. So you can see that's that's just how it works. How we're going to make comments
in Python, just by using the number pad symbol, there we go. And the way to define variables
is just by specifying the name. So it Python is a language that you don't need to declare
variables, you just declare and define everything and just one pass, you know, you find a variable,
as it goes. Python is dynamically dynamically typed. But it's also strongly typed. And these
might kind of cause confusions. But basically, you can assign variables to any value you
want. And you will see that collections etc, are heterogeneous in terms of types, etc.
It is a very dynamic language. Talking about types, I'm going to show you the most important
types that we have in Python, especially we have numbers, of course, integers, we don't
have so many like, like, you might find that other languages, like different precision
cetera, we have integers, there is also the concept of Long's that has changed with Python
two. To be honest, on Python three, to be honest, we use just integers, that's the way
we work. It's a, it's a smart enough type to save storage when needed. So that's, that's
good. And it will also have floats, right, which is the regular float type for floating
point arithmetic in other languages. And of course, it suffers if you want from strange
behavior from float floating point arithmetic, like in this case, you can prevent that by
using the decimal module, which, as you can see, doesn't suffer from from this issue.
So numbers, we have integers floats, and we also have decimals, strings are just a type
str, and they are defined literal, right, as in this in the st, you can see right here,
you can just type the string as it goes. There is a difference between there was a difference
already in Python two, between Unicode and strings, etc. In Python three, that has all
been fixed. So we Python three, this is all Unicode. And there is the concept or the difference
in terms of the concept of something being the type. The Unicode code points as it's
this string, and the underlying encoding will turn it into binary. So in Python three still
have we have a few ways to differentiate between whether it's a binary string or whether it's
a text based string. For you shouldn't worry about it, I just want you to know, if you're
writing a Python tutorial, for example, you might find a difference between Unicode strings
and regular strings, which is, is no longer something that we should be worrying about.
If you have a string that it's too long and it expands multiple lines, you can always
write it using three quotes can be double quotes or single or single quotes. So just to create multi line strings is extremely
simple. Boolean there are two Boolean type do Boolean objects are unique, right? It's
kind of a single tone which is the true or false objects. For example, They are of type
Bo. There is also the concept of No, in Python, which is none, we don't have no, we have none,
but it serves pretty much the same purpose. In Python, everything is an object. So even
this strange, strange objects, like none will have an associated class, if you want, everything
in Python is an object. So all these types of you have seen. So for example, we have
this string, which is H of a string. The type is str, you can use the int, str float bool
types, right, but it's the result of the type also as function. So in order to cast in this
case, a string into in order to cast a string into an integer, you will use it you will
do it using the end function, which is the same thing that you get with these, for example,
so this is the same as this, as you can see, what we have to show. So functions again,
death is the key word we use, we don't use function, we use death I, you can use define,
as a mnemonic, the name of the function parameters are optional, and finally have the return
keyword, you should always include a return you usually 99% of the time, the function
should return something. Because that's going to be the result assigned once we invoke the
function just this is pretty regular. If your function doesn't return anything explicitly,
if that means if you haven't written down a return statement anywhere in your function,
the function will still return something so that the fact that you haven't included a
return statement explicitly doesn't mean that the function is not returning anything implicitly,
actually, it is returning something, it's returning none, right. So by default, if you
don't include a return, Python will do this. Just for you to know a function always returns
something as specified parameters and passing parameters is pretty standard. Python has
some advanced features with parameters like for example, variable length arguments, we
can pass as many arguments, we want to make it very dynamic keyword arguments, named arguments,
etc. So all their ethic operators, you know, already, the shin modulus, in this case, were
doing a power its operation, all this is pretty standard. And the same thing happens with
all our Boolean operators greater than greater or equals then etc, there are type checking.
So this is when we have the strongly typed feature, even though Python is dynamically
typed. It is the types are enforced. In this case, you cannot compare a two with this doesn't
make any sense. And Python is going to complain about that. So this is an example of an error
in Python. The exception type error was raised on the same thing with bolens and not on or
operators. As we saw before control flow is defined by the indentation so every new block
is defined with an indentation level. Python includes if else and also l F, which is very
convenient. And this is an example If this happens, Elif, Elif, etc. Python does not
have a switch statement. For example, loops, how are you going to loop through something
in Python loops on lists, or collections in general, are very interconnected.
Because in reality, when you're looping the Python, you're not doing a regular in Python,
we don't have something like in, in Java, you're gonna have something like int i equals
zero. What else I it's been decades. And I this is I haven't
coding in Java. So I, I don't know, minus 10, less than 10 less than 10. And here we
do I put last There you go. So we don't, we don't have these in Python. We have a way
to mimic it. But we in Python, we always eat iterate over a collection. So what we're going
to do is we're going to create a range elements, and we're going to iterate over it. So the
way it works is very close to one other language is going to be a for each. Alright, so in
this case, we have all these elements and we're going to do for name in names, that's
it. And at any moment, the name is going to be associated with an element in the list.
while loops are part of the language, they are usually discouraged in favor of for loops.
If something can be coded with a for loop, it should be coded with a for loop and not
a while loop. Because as you might know, already, these my trigger or these might result in
an infinite loop if you're not checking the conditions correctly. So the collections we
have in Python, are the fundamental ones, the primitive ones, the most important ones
are first the list Python is we do a heavy usage of lists. And it's just a heterogeneous
data structure. So you can put anything in it. And actually, all these collections are
heterogeneous, you can mix volumes as you want. And in this case, we have three elements
that we have added one string, one integer, one string, and one Boolean. And let me say
something here. Even though pythons, Python supports mixed types in the collections, it
doesn't mean that you should do it. To be honest, we should, you should usually avoid
mixing types in collections, because that means we don't, we don't know what we're putting
in it, right. So it's, we should be consistent. So it's possible, revisit your code, if you
have too many different types in it. I'm checking the length length function accessing elements
is by by zero indexed, and we use square brackets. So in this case, give me the first element
given the second element. And also we can index starting from the from behind from the
end. So in this case, minus one, minus two, minus three. So in this case, minus one minus
two, again, give you different elements, you can check the operations associated with all
these elements. Very quickly, a list is L dot append, we're going to append the new
element. So the list now has that element at the end. And we can check if that element
is part of the list in this case is true in this case is false. topples are similar to
lists, they are also sequences, but the main difference is that they are immutable, there
is no way to add new elements to a tupple, or remove elements from a tupple once it has
been created. So in this case, we have created a list with three elements. Now tupple, sorry,
with three elements, we can access it, we can check if something is in it in the same
way that we did with a list. But in this case, with a tupple. Again, you cannot modify it
tupple never changes, you can't add elements to it. Another important data structure is
a dictionary. In Python, a dictionary is a key value, right and mapping, it's similar
to an object in JavaScript or hash table in in, in Java, it's a key value mapping type.
And in this case, we are going to associate values to names. So you can see this, the
way I like to explain it is if you create a topo list, right? So let's say we're going
to create a list, out of all these elements, give me one second, we're going to create
a list. There we go, we're gonna copy these elements. And we're gonna associate that to
our list. There you go. So these are a list, we could very well store the information about
our customers in a list, right? That works. I mean, I can get it done. The problem is
that whenever I need to access information about this list, we're going to say, for example,
I don't know I want to give me the email for this customer, I have to remember the position
that the email is located so in this case is going to be position number one, if these
information grows, instead of having 1234 values or four pieces of information for our
user, we have 100. Right, then it's gonna be very hard to access those individual volleys.
So that's why we create dictionaries, dictionaries are collections of values. The important part
is on the right, the important part is the value. But they are instead of just indexed
by the precision, we give them arbitrary names, we tell them very explicit names. This is
the name, this is the email. This is the age. And this is if they are subscribed or not.
So once we create these dictionary, we can access those values by the name, give me the
email of these user or is the age present of the user is the last name present of the
user in the user in the user dictionary. So again, it's a way to store information associating
later In order to make it simpler for us later, let me delete this. And I move four sets sets
are very common data structure, he is when you're learning about a collections and, and
and yeah, the instructions in general, it's not so common in too many languages. I mean,
it's not very popular in Python, we use it often because it has a very interesting feature,
first of all, and it's something that I forgot to tell you about dictionaries, both sets
and dictionaries, are what we call unordered. data structures, you never know, the order
of the elements. In Python, with recent versions, there have been changes, which make Python
dictionaries ordered. But for now, I'm going to say you shouldn't rely on it, you should
think your dictionaries as they are completely unordered data structures, and the same thing
for sets, sets are, it's a bag that contains elements, you know, it's a big bag, you keep
throwing elements inside of the set, there is no orphan in it. And what's gonna happen
with it, you're gonna odd elements, for example, to the set, or you're going to remove elements
to the set. And there is one important thing that makes this set so useful, and it's the
membership operation, I'm gonna write it down here, membership, ship operation,
there you go. So you can access these notebooks later. So in the membership operation, the the, the
process of checking if something now, nine in s, the process s of checking this is extremely
fast, it will be called oh one. And this is because as you might have seen here, when
I created this set, I included a couple of repeated elements, 333, write 11179, the resulting
set doesn't have those repeated elements, these are two features of the set, the set
will only contain unique values. And by the way, it's implemented behind the scenes will
make dot these unique values are extremely simple to check whenever you pass these membership
operation is extremely simple, or sorry, is extremely performant. It's very fast, different
from for example, a list. So keep it in mind sets are very, very useful when you're checking
for members. So again, as I told you before, we're going to iterate over collections with
the for loop. So in this case is if we have a list, it's going to be for element in list.
There you go. If you have a user dictionary, use a dictionary, sorry, in this case user,
we're going to the default iteration is by key, we're going to get for name email age
subscribed, and we have to extract the value out of the of the dictionary, we could also
do for value in user dot values. Oh, there you go. Or you can iterate over both key and
value with items. Key. And value. There you go. So each iteration in in in Python is very
readable to put it in a way. And again, remember, we're always using the for loop that assumes
that you're iterating over a collection, we don't have the for Ei equals zero equals zero,
I equals zero, i less than 10. i plus plus we don't have that right in Python, we can
simulate it with for i in range. Five, for example. Print. We've got simulated with the
range function, which generates pretty much those elements. Something that you might have
heard about Python is that it has a huge library of built in modules, right that you can just
import and just gonna work. There are so many things already coded in Python, that it makes
it very simple for you to create something on top. Do you want an a library for I don't
know security cryptography Math, numeric processing NumPy, right? machine learning web development,
creating games through is pi game, do you want to create a graphical user interface,
whatever you want to do, there is usually a library that has already been coded and
will make your job easier. On top of that, the bill team is down there library, right,
which is already included with Python, it's not third party. In this case, it's already
created by the Python core team. It's a huge library, so many modules. And the way it works
is by importing this module, so this is the way we work with packages and modules, there
are differences between modules and packages, third party ability, and this is a little
bit more advanced. But again, this gives that random number generator, it's already built
in. And you can check the docs right here. exceptions, whenever you do something that
doesn't work. So in this case, we say, if the age is greater than 21, but age is a string,
it's an it's not an integer, this is going to fail. We can catch exceptions before they
happen, that's going to be with a try and accept lock. Right. In that case, if this
fails, if anything here fails, these blocks going to be kicked in. And you can catch the
exception without the program fail failing. And you can be more explicit about the error
aspect. So again, this is just an introduction. It might be useful if you're coming from another
language, especially to keep this notebook as a reference. We're going to be using Python
a lot, of course, and it's a great language if you want to do scripting, work development,
of course processing with data, data analysis, etc, visualizations, machine learning, Python
is just great. So I hope this tiny tiny reuse lesson helps you port your knowledge from
other languages into Python. And that's it.