Data Analytics Full Course 2022 | Data Analytics For Beginners | Data Analytics Course | Simplilearn

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
data generated by companies and individuals like us has grown massively over the past decade as per statista.com the global data generated and consumed would reach 181 zettabytes by 2025. organizations need to gather and analyze such large volumes of data and look at their past business performances as well as results and use that information to prepare for the future that essentially is data analytics so hey everyone welcome to this interesting video tutorial of data analytics full course 2022 by simply learn this video will help you gain all the necessary skills and knowledge within 12 hours to become a data analyst in 2022 our experienced trainers will take you through the course but before we begin if you love watching tech videos subscribe to our channel and hit the bell icon to never miss an update from us so let's look at the topics we'll be learning in this video we'll start by understanding the basics of data analytics and learn the top 10 data analyst skills for 2022 after that we learn the top 10 data analysis tools followed by looking at a use case in python next you'll get an idea of how to manipulate and visualize data using our packages and then look at time series analysis in r then we'll shift our focus to understanding data analytics using excel pivot tables and learn to create excel dashboards moving further you'll see some interesting data analytics projects related to coronavirus spotify world happiness report and olympics finally we will learn some of the important questions that are asked frequently in data analytics interviews so let's get started hi guys welcome to another video tutorial by simply learn in today's session we will look at data analytics for beginners but before we begin make sure to subscribe to our channel and hit the bell icon to never miss an update in this video we will discuss what data analytics is and the need for data analytics then we will look at the different ways in which data analytics can be used followed by the various steps involved in the data analytics process after that we will get an idea about the different tools used in data analytics and the companies using data analytics moving forward we will see a case study on how walmart uses data analytics for better customer service and finally we will perform a regression analysis in r to predict sales based on the advertising expenditure from three mediums tv ads radio ads and newspaper advertisements so what is data analytics companies around the world are generating vast volumes of data every hour this data could be in the form of log files web server and transactional data as well as various customer related data also data has been generated at a rapid rate from social media websites and applications such as facebook instagram twitter and whatsapp companies want to use this data to derive value out of it and make business decisions that's where data analytics comes into use data analytics is the process of exploring and analyzing large data sets to find hidden patterns unseen trends discover correlations and valuable insights to make business predictions data analytics improves the speed and efficiency of your business a few years ago a business would have gathered information manually performed statistical and complex analytics and unearthed information that could be used for future decisions but today that business can identify insights on the fly for immediate decisions most organizations have big data and many understand the need to harness that data and extract value out of it so they use a lot of modern tools and technologies to perform data analytics some of the tools i will discuss in detail later in this tutorial now that we have looked at what data analytics really is let us understand the ways in which you can use data analytics first is improve decision making data analytics eliminates a lot of guesswork and manual tasks from choosing the right content planning marketing campaigns and developing products organizations can use the insights they gain from data analytics to make informed decisions leading to better outcomes and customer satisfaction it gives you a 360 degree view of your customers which helps you understand the behavior completely enabling you to better meet their needs second is better customer service data analytics provides you with more accurate insights of your customers allowing you to tailor customer service to their needs provide more personalization and build stronger relationships with them your data can reveal information about your customers communication preferences their interests their concerns and more it helps you give better recommendations for products and services next is efficient operations data analytics can help you streamline your processes save money and boost production when you have an improved understanding of what your audience wants you waste less time in creating ads and content that don't match your audience interests this helps you optimize your campaigns create better content strategies and hence improve results and finally we have effective marketing when you understand your audience better you can market to them more effectively data analytics also gives you useful insights into how your campaigns are performing so that you can fine tune them for optimal outcomes you also find out the probable customers who are the most likely to interact with a campaign and convert into leads now let's discuss the various steps involved in the data analytics process as you can see on the screen there are five process steps now let me make you understand each of this one by one so the first step is to understand the problem before starting with the analysis you need to understand the business problem and define your goals asking questions at the outlet is vital because this would address issues such as how can we reduce production costs without sacrificing quality what are some of the ways to increase sales opportunities with our current resources do customers view your brand in a favorable way answers to these questions will help you build a clear road map with lucrative solutions also try to find out the key performance indicators and consider the matrix to track along the way the second step in the process is data collection after you have finalized your goals it's time to start looking for your data data collection is the process of gathering information on targeted variables identified as data requirements the emphasis is on ensuring accurate and right data is collected data collection starts with primary sources which are also known as internal sources this is typically structured data gathered from crm software erp systems marketing automation tools and others these sources contain information about customers finances gaps in sales etc under external sources you have both structured and unstructured data so if you are looking to perform a sentiment analysis towards your brand you would gather data from various review websites or social media apps the next step is to clean the data the data which is collected from various sources is highly likely to contain incomplete duplicate and missing values so you need to clean these unwanted redundant data to make it ready for analysis so to generate accurate results analytics professionals must identify duplicate and analyst data and other inconsistencies that could skew the analysis according to a report sixty percent of data scientists say most of the time is spent cleaning the data while 57 percent of data scientists say it's their least enjoyable task now the fourth step in the process is data exploration and analysis once data is cleaned and ready you can go ahead and explore the data using data visualization and business intelligence tools you can also use various data mining and predictive modeling techniques to analyze the data and build models you can use different supervised and unsupervised algorithms such as linear regression logistic regression decision tree k nn means clustering and lots more to build prediction models for making business decisions and the final step is to interpret the results this part is important because it's how a business will gain actual value from the previous four steps interpreting the results will help you find unseen trends and patterns in the data and gain insights you can have a validation check if the results are answering your questions these results can be shown to your clients and stakeholders for better understanding and business collaboration now that we have looked at the various steps involved in data analytics let's now see the different tools that can be used to perform the above steps so as you can see we have seven tools including a few programming languages that will help you perform analytics better now let's discuss them one by one first we have python python is an object oriented open source programming language that supports a range of libraries for data manipulation data visualization and data modelling python programmers have developed tons of free and open source libraries that you can use you can find many of them via the python package index which is pypi the repository of python software python provides the default package installer called pip or pip python has libraries such as numpy for numerical computation of data pandas to manipulate data on numerical tables and time series then you have scipy for technical and scientific computations it also provides scikit-learn which is a machine learning library for creating classification regression and clustering algorithms and finally it also has pytorch and tensorflow for deep learning up next we have r r is an open source programming language majorly used for numerical and statistical analysis it provides a range of libraries for data analysis and visualization some of these libraries are ggplot tidevos plotly deployer and carrot then we have tableau tableau is a popular data visualization and analytics tools that helps you create a range of visualizations to interactively present the data build reports and dashboards to showcase insights and trends it can connect with multiple data sources and give hidden business insights and patterns then we have a competitor of tableau which is power bi power bi is a business intelligence tool developed by microsoft that has an easy drag and draw functionality and supports multiple data sources with features that make data visually appealing power bi supports features that help you ask questions to your data and get immediate insights you can also forecast your data for predicting future trends so the next tool is click view click view provides interactive analytics with in-memory storage technology to analyze vast volumes of data and use data discoveries to support decision making it provides social media discovery and interactive guided analytics it can manipulate huge data sets instantly with accuracy up next we have apache spark apache spark is an open source data analytics engine to process data in real time and carry out complex analytics using sql queries and machine learning algorithms it supports spark streaming for real-time analytics and spark sql for writing sql queries it also has spark ml lib which is a library that has a repository of machine learning algorithms and then it has graphics for graphical computation and finally we have sas sas is a statistical analysis software that can help you perform analytics visualize your data write sql queries perform statistical analysis and build machine learning models to make future predictions sas empowers our customers to move the world forward by transforming data into intelligence sas is investing a lot to drive software innovation for analytics gartner has positioned sas as a magic quadrant leader for data science and machine learning moving on to the applications of data analytics data analytics has been used in almost every sector of business these days let's discuss a few of them first we have retail customers expect retailers to understand exactly what they need and when they need it data analytics helps retailers meet those demands retailers not only have an in-depth understanding of their customers but they can also predict trends recommend new products and boost profitability retailers create assortments based on customer preferences invoke the most relevant engagement strategy for each customer optimize supply chain and retail operations at every step of the customer journey the second application is on healthcare healthcare industries analyze patient data to provide life-saving diagnosis and treatment options they also deal with healthcare plans and insurance information to drive key insights using analytics they can discover new drugs and come up with new drug development methods advanced analytics allows healthcare companies to improve patient outcomes and experience cancer cells and diabetic retinopathy can be discovered using medical imaging at number 3 we have manufacturing for manufacturers solving problem is nothing new they fight with difficult problems and situations on a daily basis from complex supply chains to motion applications to labor constraints and equipment breakdowns they deal with such problems on a regular basis using data analytics manufacturing sectors can discover new cost saving and revenue opportunities the fourth application is related to the banking sector banking and financial institutions collect vast volumes of structured and unstructured data to derive analytical insights and make sound financial decisions using analytics they can find out probable loan defaulters customer churnout rate and detect fraudulent transactions immediately the final application is based on logistics logistics companies use data analytics to develop new business models that can ease their business and improve productivity they can optimize routes to ensure delivery reaches on time in a cost efficient manner they also focus on improving order processing capabilities as well as performance management with that now let's look at the companies using data analytics on a daily basis so we have the e-commerce giant amazon then we have accenture followed by the american healthcare service organization cigna then we have the american supplier of health information technology solutions services devices and hardware cerner followed by target and antivirus company mccafe next we have rapido which is an indian bike rental company based in bangalore after that we have flipkart and the world's largest retail company walmart so sky's the limit on what you use it for let's take a look at types of data analytics and this can be broken up in so many ways but we're going to start with looking at the most basic questions that you're going to be asking in data analytics and the first one is you want descriptive analytics what has happened hindsight how many cells per call ratio coming out of the call center if we have 500 tourists in a forest and you have a certain temperature how many fires were started how many times did the police have to show up to certain houses all that's descriptive the next one is predictive predictive analytics is what will happen next we want to predict this is great if you have a ice cream store and you want to predict how many people to work at the ice cream store in a certain day based on the temperature coming up in the time of the year and then one of the biggest growing and most important parts of the industry is now prescriptive analytics and you can think of that as combining the first two we have descriptive and we have predictive then you get pre-scriptive analytics how can we make it happen foresight what can we change to make this work better in all the industries we looked at before we can start asking questions especially in city development there's a good one if we want to have our city generate more income and we want that income to be commercial based what kind of commercial buildings do we need to build in that area that are going to bring people over do we need huge warehouse sales costco sales buildings or do we need little mom pod joints that are going to bring in people from the country to come shop there or do you want an industrial setup what do you need to bring that industry in there is our car industry available in that area if it's not a car industry what other industries are in that area all those things are prescriptive we're guessing we're guessing what can we do to fix it what can we do to fix crime in area with education what kind of education are we going to use to help people understand what's going on so that we lower the rate of crime and we help our communities grow better that's all prescriptive it's all guessing we want foresight into how can we make it happen how can we make this better and we really can't not go into enough detail on these three because a lot of people stumble on this when they come in and are doing analytics whether you're the manager shareholder or the data scientist coming in you really need to understand the descriptive analytics where you're studying the total units of furniture sold and the profit that was made in the past uh here we go into predictive analytics predicting the total units that would sell and the profit we can expect in the future gear up for how many employees we need how much money we're going to make and prescriptive analytics finding ways to improve the sales and the profit so we can sell maybe a different kind of furniture we're going to guess at what the area is looking for and how that marketing is going to change hello everyone we welcome you all to this video by simply learn in today's session we will learn a really interesting topic that is the top 10 skills to become a data analyst in 2022 in today's digital world data is being generated by companies and individuals every second so the role of a data and list holds supreme importance so if you're looking for a career in data analytics this video will help you learn what a data analyst does and the various skills you need to possess to become a data analyst in 2022 before we get started make sure you subscribe to the simply learn channel and hit the bell icon to never miss an update from us let's look at the agenda for this video first we will understand who our data analyst is then we will understand the top 10 data analyst skills for 2022 moving on we will look at the salary of a data analyst and finally we will look at the companies hiring data analysts so now let's understand who is a data analyst a data analyst is a professional who collects business data from various sources interprets it and uses various statistical tools and techniques to extract insights and useful information from it they acquire data from primary or secondary data sources and maintain databases they also recognize and understand the organization's goal and collaborate with different team members such as programmers business analysts and data scientists to build an effective solution to a business problem now with this basic understanding of who a data analyst is let's learn the top 10 data analyst skills for 2022 at number one we have structured query language or sql sql is a top skill that every data analyst should have data analysts use sql commands and functions to store process analyze and manipulate structured data using relational and nosql databases they also build data models and write complex sequel queries and scripts to gather and extract information from several databases and data warehouses some of the popular databases a data analyst should be familiar with are microsoft sql server mysql postgresql and ibm db2 the second important skill for a data analyst is microsoft excel microsoft excel is one of the most popular and oldest spreadsheet applications for creating reports performing calculations and analyzing data data analysts need to know how to handle tabular data in excel so they should be aware of features like sorting filtering conditional formatting pivot tables what-if analysis and functions such as sumifs and countifs the third crucial data analyst skill for 2022 is data cleaning and wrangling usually the data collected by analysts from various heterogeneous sources is often messy and contains a lot of missing values so it is always crucial to clean the data and remove noise missing or erroneous elements it is also important to format data using tools and methods before using it for analysis are responsible for data mining as well the data mined from various sources are organized in order to obtain new information from it some of the tools you need to know for data cleaning and wrangling are excel power query and open refine the fourth skill on our list is mathematics and statistics data and lists often work on data for higher dimensions that are greater than three in order to interpret such data they need to be good at linear algebra and calculus they also build predictive models and statistical models such as linear regression logistic regression naive bayes and k-means clustering in order to understand the working of these algorithms they must have knowledge about statistics and probability coming to the fifth important skill for a data analyst in 2022 we have programming data analysts need to master at least one programming language preferably python or r in order to work with complex business problems analysts need to write scripts and user-defined functions to automate tedious tasks python and r language provide a collection of different libraries and packages such as numpy pandas d plier matplotlib gg plot which data analysts can use to discover trends and patterns from complex data sets after this we have data visualization as our sixth skill another data analyst job role is to visualize large volumes of data and prepare summary reports and dashboards for the leadership team and clients so that they can make timely business decisions to do this data analysts use various data visualization tools such as power bi tableau and click view using these tools data analysts can integrate various data sets apply join conditions sort and filter data as well create different visualizations using charts and graphs the seventh skill for a data analyst is industry knowledge data analysts should have good knowledge and understanding of the industry or domain they are working in for example if you're working in a healthcare domain you need to know how healthcare analytics can be applied to improve patient care you should have knowledge about the challenges faced in healthcare and how you can leverage data and analytics to solve the issues only if you have strong industry knowledge can you try to improve the business the eighth skill that is important for a data analyst in 2022 is problem solving a business deals with several problems on a daily basis data analysts should be ready to face those challenges data analysts are expected to use their problem solving skills work with the team troubleshoot what went wrong and provide an effective solution via data analysis a data analyst with good problem solving skills can help a business identify current and potential issues and determine a viable solution based on the data it collects the ninth skill on our list is analytical thinking data and lists need analytical thinking ability to break down a complex problem into simple components and resolve these components one by one it is a must-have skill for data analysts analytical thinking includes deciding the parameters that need to be considered for defining data sets analyzing them from different perspectives and determining variable dependencies coming to the 10th skill among the top 10 skills for a data analyst in 2022 we have communication data analysts don't just interact with computers and programs they also interact with team members stakeholders and data suppliers so good communication skills are essential data analysts also present their findings in front of an audience who might not be familiar with the analytical methods and processes so they need to clearly translate their findings and insights into non-technical terms so those were the top 10 skills our data analyst needs to possess in 2022. do you think we missed out on any skills then please put your answers in the comments section below now let's look at the salary of a data and list according to glassdoor the average annual salary for a data analyst in the united states is 69 517 while in india you can earn nearly 7 lakh rupees per random finally let's look at the top companies that are hiring data analysts in 2022. here we have the consultancy and big4 giant deloitte and the pharmaceutical company cerner corporation then we have the tech giant ibm retail company walmart and the e-commerce leader amazon to achieve the goals of data analysis we use a number of data analysis tools companies rely on these tools to gather and transform their data into meaningful insights so which tool should you choose to analyze your data which tool should you learn if you want to make a career in this field we will answer that in this session after extensive research we have come up with these top 10 data analysis tools here we will look at the features of each of these tools and the companies using them so let's start off at number 10 we have microsoft excel all of us would have used microsoft excel at some point right it is easy to use and one of the best tools for data analysis developed by microsoft excel is basically a spreadsheet program using excel you can create grids of numbers text and formulae it is one of the widely used tools be it in a small or large setup the interface of microsoft excel looks like this let's now move on to the features of excel firstly excel works with the windows version of excel supports programming through microsoft's visual basic for applications vba programming with vba allows spreadsheet manipulation that is difficult with standard spreadsheet techniques in addition to this the user can automate tasks such as formatting or data organization in vba one of the biggest benefits of excel is its ability to organize large amounts of data into orderly logical spreadsheets and charts by doing so it's a lot easier to analyze data especially while creating graphs and other visual data representations the visualization can be generated from specified group of cells those were few of the features of microsoft excel let's now have a look at the companies using it most of the organizations today use excel few of them that use it for analysis are the uk based company ernest and young then we have urban pro wipro and amazon moving on to our next data analysis tool at number nine we have rapidminer a data science software platform rapidminer provides an integrated environment for data preparation analysis machine learning and deep learning it is used in almost every business and commercial sector rapidminer also supports all the steps of the machine learning process seen on your screens is the interface of rapidminer moving on to the features of rapidminer firstly it offers the ability to drag and drop it is very convenient to just drag drop some columns as you are exploring a data set and working on some analysis rapidminer allows the usage of any data and it also gives an opportunity to create models which are used as a basis for decision making and formulation of strategies it has data exploration features such as graphs descriptive statistics and visualization which allows users to get valuable insights it also has more than 1500 operators for every data transformation and analysis task let's now have a look at the companies using rapidminer we have the caribbean airline leeward islands air transport next we have the united health group the american online payment company paypal and the austrian telecom company mobilecom so that was all about rapidminer now let's see which tour we have at number eight we have talent at number eight talent is an open source software platform which offers data integration and management it specializes in big data integration talent is available both in open source and premium versions it is one of the best tools for cloud computing and big data integration the interface of talent is as seen on your screens moving on to the features of talent firstly automation is one of the great boons talent offers it even maintains the tasks for the users this helps with quick deployment and development it also offers open source tools talon lets you download these tools for free the development costs reduce significantly as the process is gradually speed up talent provides a unified platform it allows you to integrate with many databases sas and other technologies with the help of the data integration platform you can build flat files relational databases and cloud apps 10 times faster those were the features of talon the companies using talent are air france l'oreal cab gemini and the american multinational pisa restaurant chain dominos next on the list at seven we have nine constance information minor on nime is a free and open source data analytics reporting and integration platform it can integrate various components for machine learning and data mining through its modular data pipelining concept nime has been used in pharmaceutical research and other areas like crm customer data analysis business intelligence text mining and financial data analysis here is how the interface of nime application looks like now coming to the nine features 9 provides an interactive graphical user interface to create visual workflows using the drag and drop feature use of jdbc allows assembly of nodes blending different data sources including preprocessing such as etl that is extraction transformation loading for modeling data analysis and visualization with minimal programming it supports multi-threaded in-memory data processing 9 allows users to visually create data flows selectively execute some or all analysis steps and later inspect the results models and interactive views nine server automates workflow execution and supports team based collaboration nime integrates various other open source projects such as machine learning algorithms from weka hedge tour keras park and our project 9 allows analysis of 300 million custom addresses 20 million cell images and 10 million molecular structures some of the companies hiring for nime are united health group asml fractal analytics atos and lego group let's now move on to the next tool we have sas at number six sas facilitates analysis reporting and predictive modeling with the help of powerful visualizations and dashboards in sas data is extracted and categorized which helps in identifying and analyzing data patterns as you can see on your screens this is how the interface looks like moving on to the features of sas using sas better analysis of data is achieved by using automatic code generation ads as sql sas allows you to access through microsoft office by letting you create reports using it and by distributing them through it sas helps with an easy understanding of complex data and allows you to create interactive dashboards and reports let's now have a look at the companies using sas we have companies like genpak iqea accenture and ibm to name a few that was all about sas so for all those who joined in late let me just quickly repeat our list at number 10 we have microsoft excel then at number nine we have rapidminer at number eight we have talent at number seven we have nine and at number six we have sas so far do you all agree with this list let us know in the comment section below let's now move on to the next five tools in our list so at number five we have both r and python yes we have two of them in the fifth position r is a programming language which is used for analysis as well it has traditionally been used in academics and research python is a high level programming language which has a python data analysis library it is used for everything starting from importing data from excel spreadsheets to processing them for analysis this is the interface of r next up is the interface of the python jupyter notebook let's now move on to the features of both r and python when it comes to the availability of r and python it is very easy both r and python are completely free hence it can be used without any license r used to compute everything in memory and hence the computations were limited but now it has changed both r and python have options for parallel computations and good data handling capabilities as mentioned earlier as both r and python are open in nature all the latest features are available without any delay moving on to the companies using r we have uber google facebook to name a few python is used by many companies again to name a few we have amazon google and the american photo and video sharing social networking service instagram that was all about rn python at number four we have apache spark apache spark is an open source engine developed specifically for handling large-scale data processing and analytics spark offers the ability to access data in a variety of sources including hadoop distributed file system htfs openstack swift amazon s3 and cassandra it allows you to store and process data in real time across various clusters of computers using simple programming constructs apache spark is designed to accelerate analytics on hadoop while providing a complete suite of complementary tools that include a fully featured machine learning library a graph processing engine and stream processing so this is how the interface of apache spark looks like now let's look at the important features of apache spark spark stores data in the ram hence it can access the data quickly and accelerate the speed of analytics spark helps to run an application in a hadoop cluster up to 100 times faster in memory and 10 times faster when running on disk it supports multiple languages and allows the developers to write applications in java scala r or python spa comes up with 80 high level operators for interactive querying spark code for batch processing joins stream against historical data or run ad hoc queries on stream state analytics can be performed better as spark has a rich set of sql queries machine learning algorithms complex analytics etc apache spark provides fault tolerance through spark rdd spark resilient distributed data sets are designed to handle the failure of any worker node in the cluster thus it ensures that the loss of data reduces to zero conviva netflix iqea lockheed martin and ebay are some of the companies that use apache spark on a daily basis at number 3 we have another important growing data analysis tool that is click view click qlikview software is a product of click for business intelligence and data visualization qlikview is a business discovery platform that provides self-service bi for all business users and organizations with qlikview you can analyze data and use your data discoveries to support decision making clickview is a leading business intelligence and analytics platform in gartner magic quadrant on the screen you can see how the interface of qlikview looks like now talking about its features clickview provides interactive guided analytics with in-memory storage technology during the process of data discovery and interpretation of collected data the qlikview software helps the user by suggesting possible interpretations qlikview uses a new patent in memory architecture for data storage all the data from the different sources is loaded in the ram of the system and it is ready to be retrieved from there it has the capability of efficient social and mobile data discovery social data discovery offers to share individual data insights within groups or out of it a user can add annotations as an addition to someone else's insights on a particular data report qlikview supports mobile data discovery within an html5 enabled touch feature which lets the user search the data and conduct data discovery interactively and explore other server-based applications qlikview performs olap and etl features to perform analytical operations extract data from multiple sources transform it for usage and load it to a data warehouse the companies that can help you start your career in qlikview are mercedes-benz camp gemini citibank cognizant and accenture to name a few at number two we have power bi power bi is a business analytic solution that lets you visualize your data and share insights across your organization or embed them in your app or website it can connect to hundreds of data sources and bring your data to life with live dashboards and reports power bi is the collective name for a combination of cloud-based apps and services that help organizations collate manage and analyze data from a variety of sources through a user-friendly interface power bi is built on the foundation of microsoft excel and has several components such as windows desktop application called power bi desktop and online software as a service called power bi service mobile power bi apps available on windows phones and tablets as well as for ios and android devices here is how the power bi interface looks like as you can see there is a visually interactive sales report with different charts and graphs moving on to the features of power bi it has an easy drag and drop functionality with features that make data visually appealing you can create reports without having the knowledge of any programming language power bi helps users see not only what's happened in the past and what's happening in the present but also what might happen in the future it offers a wide range of detailed and attractive visualizations to create reports and dashboards you can select several charts and graphs from the visualization pane power bi has machine learning capabilities with which it can spot patterns in data and use those patterns to make informed predictions and run what-if scenarios power bi supports multiple data sources such as excel tech csv oracle sql server pdf and xml files the platform integrates with other popular business management tools like sharepoint office 365 and dynamics 365 as well as other non-microsoft products like spark hadoop google analytics sap sales force and mailchimp some of the companies using power bi are adobe axa carlsberg capgemini and nestle moving on to the next tool so any guesses as to what we have at number one you can comment in the chat section below finally on the top of the pyramid we have tableau gartner's magic quadrant of 2020 classified tableau as a leader in business intelligence and data analysis tableau interactive data visualization software company was founded in jan 2003 in mountain view california tableau is a data visualization software that is used for data science and business intelligence it can create a wide range of different visualization to interactively present the data and showcase insights the important products of tableau are tableau desktop tableau public tableau server tableau online and tableau reader this is how the interface of tableau desktop looks like now coming to the features of tableau data analysis is very fast with tableau and the visualizations created are in the form of dashboards and worksheets tableau delivers interactive dashboards that support insights on the fly it can translate queries to visualizations and import all ranges and sizes of data writing simple sql queries can help join multiple data sets and then build reports out of it you can create transparent filters parameters and highlighters tableau allows you to ask questions spot trends and identify opportunities with the help of tableau online you can connect with cloud databases amazon redshift and google bigquery the companies using tableau are deloitte adobe cisco linkedin and the american e-commerce giant amazon to name a few and there you go those are the top 10 data analysis tools since this is data analysis with python we've got to ask the question why python for data analytics i mean there's c plus there's java there's dot net from microsoft why do people go to python for it so the number of reasons one it's easy to learn with simple syntax you don't have a very high type set like you do in java and other coding so it allows you to kind of be a little lazy in your programming that doesn't mean that it can't be set that way and that you don't have to be careful it just means you can spin up a code much quicker in python the same amount of code to do something in python a lot of times is one two or three or four lines where when i did the same thing say in java i found myself with 10 12 13 20 lines depending on what it was it's very scalable and flexible so there's our flexibility because you can do a lot with it and you can easily scale it up you can go from something on your machine to using uh pi spark under the spark environment and spread that across hundreds if not thousands of servers across terabytes of data or petabytes of data so it's very scalable there's a huge collection of libraries this one's always interesting because java has a huge collection of libraries c has a huge collection of libraries dot net does and they're always in competition to get those libraries out scala for your spark all those have huge collections libraries this is always changing but because python is open source you almost always have easy to access libraries that anybody can use you don't have to go check your licensing and have special licensing like you do in some packages graphics and visualization they have a really powerful package for that so it makes it easy to create nice displays for people to read and community support because python is open source it has a huge community that supports it you can do a quick google and probably find a solution for almost anything you're working on python libraries let's bring it together we have data analytics and we have python so when we're talking data analytics we're talking python libraries for data analytics and the big five players are numpy pandas matplot library scipy which is going to be in the background so we're not going to talk too much about the scientific formulas inside pi and psi kit so numpy supports n-dimensional arrays provides numerical computing tools useful for linear algebra and fourier transform and you can think of this as just a grid of numbers and you can even have a grid inside a grid or data it's not even numbers because you can also put words and characters and just about anything into that array but you can think of a grid and then you can have a grid inside a grid and you end up with a nice three-dimensional array if you want to talk three-dimensional array you can think of images you have your three channels of color four if you have an alpha and then you have your x y coordinates for the image we're looking at so you can go x y and then what are the three channels to generate that color and numpy isn't restricted to three dimensions you could imagine watching a movie well now you have your movie clips and they each have their x number of frames and each of those frames have x number of x y coordinates for the pictures in each frame and then you have your three dimensions for the colors so numpy is just a great way to work with in-dimensional arrays now closely with numpy is pandas useful for handling missing data perform mathematical operations provides functions to manipulate data pandas is becoming huge because it is basically a data frame and if you're working with big data and you're working in spark or any of the other major packages out there you realize that the data frame is very central to a lot of that and you can look at it as a excel spreadsheet you have your columns you have your rows or indexes and you can do all kinds of different manipulations of the data within including filling in missing data which is a big thing when you're dealing with large pools or lakes of data where they might be collected differently from different locations and matplot library we did kick over the sci pi which is a lot of mathematical computations which usually runs in the background of the fur of numpy and pandas although you do use them they're useful for a lot of other things in there but the matte plot library that's the final part that's what you want to show people and this is your plotting library in python several toolkits extend matplot library functionality there's like a hundred different toolkits to extend matplot library which range from how to properly display star constellations from astronomy there's a very specific one built just for that all the way to some very generic ones we'll actually add seaborne in when we do the labs in a minute several toolkits extend map plot library functionality and it creates interactive visualization so there's all kinds of cool things you can do as far as just displaying graphs and there's even some that you can create interactive graphs we won't do the interactive graphs but you'll see you'll get a pretty good grasp of some of the different things you can do in matplot library let's jump over to the demo which is my favorite roll up our sleeves get our hands in on what we're doing now there's a lot of options when we're dealing with python you can use pycharm is a really popular one and you'll see this all over the place so it's one of the main ones that's out there and there's a lot of other ones i used to use netbeans which is kind of lost favor don't even have it installed on my new computer but the most popular one right now for data science now pycharm is really popular for python general development for data science we usually go to jupiter notebook or anaconda and we're going to jump into anaconda because that's my favorite one to go to because it has a lot of external tools for us we're not going to dig into those but we will pop in there so you can see what it looks like so with anaconda we have our jupiter lab we have our notebook these are identical jupiter lab is an upgrade to the notebooks with multiple tabs that's all it is and we'll be using the notebook and you can see that pycharm is so popular with um python that we even have it highlighted here in anaconda as part of the setup jupiter notebook can also be a standalone so we're actually going to be running jupiter notebook and then you have your different environments i have we're going to be under main pi 36 there's a root one and i usually label it pi 3 6. the reason is is currently as of writing this tensorflow only works in 3 6 and not in 3 7 or 3 8 for doing neural networks but you can actually have multiple environments which is nice there they separate the kernels so it helps protect your computer when you're doing development and this is just a great way to do a display or a demo especially if you're looking for that job pull up your laptop open it up or if you're doing a meeting get it broadcast up to the big screen so that the ceo can see what you're looking at and when we launch the notebook it actually opens up a file browser in whatever web browser you have this happens to be chrome and then you can just go under new there's a lot of different options depending what you have installed python3 and this just creates an untitled version of this and you can see here i'm actually in a simply learn folder for other work i've done for simplylearn and that's where i save all my stuff and i can browse through other folders making it real easy to jump from one project to another and under here we'll go ahead and change the name of this and we'll go ahead and rename it data analytics data analytics just so i can remember what i was doing which is probably about 50 of the folders in here right or files in here right now uh so let's go ahead and jump in there and take a look at some of these different tools that we were looking at and as we go through the demo let's start with the numpy uh the least visually exciting and i'm going to zoom in here so you can see what we're doing and the first thing we want to do is import numpy and we'll import it as np that is the most common numpy terminology and let's go ahead and change the view so we also have the line numbers i don't know why we probably won't need them but make it for easy reference uh and then we'll create a one dimensional array we'll just call this array one and it equals np dot array and you put your array information in here in this case we'll spell it out uh you can actually do like a range in other ways there's lots of ways to generate these arrays but we'll just do one two three so three integers and if we print our array one we can go ahead and run this and you can see right here prints one two three you can see why this is a really nice interface to show other people what you're doing with the jupiter notebook so this is the basic we've created an array this is a one dimensional array and then the array is one two three one of the nice things about the jupiter notebook is whatever ran in this first setup is still running it's still in the kernel so it still has the numpy imported as np and it still has our variable arr1 for array one equal to np array of one two three so when we go to the next cell we can check the type of the array we're just gonna print we say hey what's what is this setup in here and we want type and then we want what is the type of array one let's go ahead and run that and it says class numpy nd array so it's its own class that's all we're doing is checking to see what that class is and if you look at the array class probably the biggest thing you do i don't know how many times i find myself doing this because i forget what i'm working on and i forget i'm working with a three-dimensional or four-dimensional array uh and i have to reformat somehow so it works with whatever other things i have and so we do the array shape the array shape is just three because it has three members and it's a one-dimensional array that's all that is and with the numpy array we can easily access stick with the print statement if you actually put a variable in jupyter notebook and it's the last one in the cell it will be the same as a print statement so if i do this where array 1 of 2 is the same as doing print array of 2. that's those are identical statements in our jupyter notebook we'll go and stick with the print on this one and it's three so there's our print space two and we have zero one two two equals three we can easily change that so we have array one of place two equals five and then if we print our array one uh you can see right down here when it comes out it's one two and five and there i left the print statement off because it's the last variable in the list it'll always print the variable if you just put it in like that that's a jupiter notebook thing don't do that in pycharm i've forgotten before doing a demo and we talked about multiple dimensions so we'll do an array two dimensional array and this is again a numpy array and in the numpy array we need our first dimension we'll do one two three and our second dimension uh three four five and you can see right here that when we hit the uh we'll do this we'll just do array two and we can run that and there's our array two one two three three four five we can also do array two of uh one and then we can do let's do zero it doesn't matter which one actually do two there we go and if i run this it'll print out five because here we are this is zero uh zero one two three is under zero row three four five is on our one row and we start with zero and then the two zero one two goes to the five and then maybe we forgot what we were working with so we'll go do array two dot shape and if we do array two of shape we'll go and run that we'll see we have two rows and each row has three elements a two dimensional array two three if you looked up here when we did it before it just had three comma nothing when you have a single entity it always saves it as a tuple with a blank space but you can see right here we have two comma three and if you remember from up here we just did this array two of oh let's go what is that one comma two we run that we get the five you can also count backwards this is kind of fun and you'll see i just kind of switch something on you because you can also do one comment two to get to the same spot now two is the last one zero one two it's the last one in there we can count backwards and do minus one and if we run this we get the same answer whether we count it as let's go back up here whether we count this as 0 1 2 or we count backwards as minus 1 minus 2 minus 3. and you can see that if i change this minus 1 to a minus 2 and run that i get 4 which is going backwards minus 1 minus 2. so there's a lot of different ways to reference what we're working on inside the numpy array it's really a cool tool it's got a lot of things you can do with it and we talked about the fact that it can also hold things that are not values and we'll call this array s for strings equals np.array put our setup in there brackets and let's go china um india usa mexico it doesn't matter we can make whatever we want on here and if we print that out and we run this you can see that we get another numpy of ray china india usa mexico it even gives us our d type of a u6 and a lot of times when you're messing with data we'll call this array r for range just to kind of keep it uniform in p dot a range so this is a command inside numpy to create a range of numbers and if you're testing data maybe you want maybe you have equal time increments that are spaced a certain point apart but in this case we're just going to do integers and we're going to do a setup from 0 20 skipping every other one and we'll print it out and see what that looks like and you can see here we have 0 2 4 6 8 10 12 14 16 18 like you expected it skips every one and just a quick note there's no 20 on here uh why well this starts at 0 and counts up to 20. so if you're used to another language where explicitly says uh less than or less than equal to 20 like for x equals 0 x plus plus x is less than 20. that's what this is it just assumes x is less than 20 on here and if we want to create a very uniform set you know 0 2 4 6 what happens if i want to create numbers from 0 to 10 but i need 20 increments in there we can do that with line space so we can create an r uh we'll call this l equals i don't think we'll actually use any of this again so i don't know why i'm creating unique identifiers for it but we'll do np lin space and we're going to do 0 to 10 or 0 to 9. remember it doesn't it goes up to 10. and then we want to let's say we have 20 different um increments in there so we're creating a we have a data set and we know it's over a certain time period and we need to divide that time period by 20 and it happens to just have 10 pieces in it and here we go you can see right here we have 20 or it has 20 pieces in it but it's over 10 years and we got divided in the middle and you can see it does it goes 0.52 remember the others are 10 on the end so it goes up to 10. uh and then we can also do random there's np.random if you're doing neural networks usually you start it by seeding it with random numbers and we'll just do np.random and we'll just call this array we'll stop giving it unique numbers we'll print that one out and run it and you can see we have random numbers they are zero to one so you'll see that all these numbers are under one and you can easily alter that by multiplying them out or something like that if you want to do like zero to a hundred um you can also round them up if it's integer zero to a hundred there's all kinds of things you can do but generates a random float between zero and one and you have a couple options you could reshape that or you can just generate them in whatever shape you want and so we can see here we did three and four and so you can see three rows by four variables same thing as doing a reshape of 12 variables to three and four and if you're going to do that you might need an empty data set i have had this come up many times or i need to start off with zero and i don't know you know because i'm going to be adding stuff in there or it might be zero and one where one is uh if you're removing the background of an image you might want the background is zero and then you figure out where the image is and you set all those boxes to one and you create a mask so creating masks over images is really big and doing that with a numpy array of zero and we can also uh give it a space and we'll just do this all in one shot this time and we'll do the same thing like we did before zeros and in this case we'll do uh 2 comma 3. and so when we run this i forgot the asterisks around it i knew it was for getting something there we go so when we run this you can see here we have our ten zeros in a row and maybe this is a mask for an image and so it has uh two rows of three digits in it so it's a very small image a little tiny pixel and maybe you're looking to do something the opposite way instead of creating a mask of zeros and filling in with ones maybe you want to create a mask of ones and fill them in with zeros and we'll just do just like we did before with the three comma four and when we run this you'll see it's all ones and we could even do this even maybe we'll do it this way let's do 10 10 by 10 icon and then you have your three colors so creates quite a large array there for doing pictures and stuff like that when you add that third dimension in if we take that off it's a little bit easier to see we'll do 10 again and you can easily see how we have 10 rows of 10 ones and you can also do something like create an array and we'll do 0 1 2. and then in this array we actually print right out we want a repeat so you can actually do a repeat of the array and maybe you need this array let's repeat it three times so there's our repeat of an array repeat three times and if we run this you'll see we have zero zero zero one one one two two two and whenever i think of a repeat i don't really think of repeating being the first digit three times the second digit i really always think of it as zero one two 0 1 2 0 1 2. it catches me every time but the actual code for that one is going to be tile and again if we do a range 3 and we run this you can see how you can generate one zero one two zero one two zero one two and if you're dealing with um an identity matrix um we can do that also if you're big on you're doing your matrixes and we'll just identity i guess we'll go ahead and spell it out today may tricks and the command we're looking for is um i e y e and we'll do three and then we'll just go ahead and print this out there we go there's our identity matrix and it comes out by a three by three array because there's our matrix and then it puts the ones down the middle and for doing your different matrix math and we can manipulate that a little bit too we talk about matrixes might not want ones across the middle in which case we now have the diagonal so we can do an np dot diagonal and we do a diagonal let's put in the diagonal one two three four five and when we run this again this generates a value and by just putting that value in there is the same as putting print around it or putting array equals and then print array and you can see it generates a diagonal one two three four five and there's your uh your beginning of your matrix array for working with matrixes and we can actually go in reverse uh let's create an array equals remember our random random.random and we'll do a five by five array oops there we go five by five and just so you can see what that looks like helps if i don't mistype the numbers which in this case i just need to take out the brackets and there you go you have your your five by five array set up in there and we can now because we're working with matrixes we might want to do this in reverse and extract the diagonals which would be the 0.79 the 0.678 and so on and we simply type in np.diagonal we put our array in there and this will of course print it out because it returns it as a variable and you can see here here's our diagonal going across from our matrix and we did talk about shape earlier if you remember you can do print the shape out you can also do the dimensions so in dimensions very similar to shape it comes out and just has two dimensions we can also look at the size so if we do size on here we can run that and you can see has a size of 25 two dimensions and of course 5x5 and that was from the shape from earlier that we looked at there's our 5x5 shape and if you remember earlier we did random well you can also do random i talked a little bit about manipulating zero to one and how we can get different answers you can also do straight for the integer part and we'll do minus 10 to ten four and so we're going to generate random integers between minus 10 to 10. we're going to generate four of those once we run that we have 7 minus 3 minus 6 minus 3. they're all between minus 10 and 10 and there's four of them and now we jump into some of the functionality of arrays which is really great because this is where they come in here's your array and you can add 10 to it and if i run this there takes my original array from up here with the integers and adds 10 to all of those values so now we have oh this is the decimal that's right this is a random decimal i had stored an array but this takes a random decimal the random numbers i had from 0 to 1 and adds 10 to them and we can just as easily do minus 10 we could even do times two and we could do divide by two and it would it'll take that random number we generated and cut in half so now all these numbers are under 0.5 another way you can change the numbers to what you need on there and as you dig deeper into numpy we can also do exponential so as an exponential function which would generate some interesting numbers off of the random so we're taking them to the power i don't even remember what the original numbers in the array were because we did the random numbers up there here's our original numbers and if you build an exponential on there this is where you get e to the x on this and just like you can do e to the x you can also do the log so if you're doing logarithmic functions that reinforce learning you might be doing some kind of log setup on there and you can see the logarithmic of these different array numbers and if you're working with log base 2 you can do you can just change it in there in p log 2. you have to look it up because this is not log one two three four five it is log and log two so just a quick note that's not a variable going in that is an actual command there's a number of them in there and you'll have to go look and see what the documentation is but you can also do log 10. so here's log value 10. some other really cool functions you can do with this is your sign so we can take a sine value of all of our different values in there and if you have sine you of course have cosine we can run that so here's the cosine of those and if you're doing activations in your numpy array you're doing a tangent activation there's your tangent for that and the tangent activation is actually uh from neural networks that's one of the ways you can activate it because it forms a nice curve between uh from whether you're generating one to negative one uh with some discrepancy in the middle just jumping a little bit in there into neural networks and then we get into let me just put the array back out there so we can see it uh while we're doing this as we're getting into this you can also sum the values so we have np sum and you can do a summation of all the values in this array and you'll see that if you added all these together they'd equal 12.519 and so on i don't know what the whole setup is in there but you can see right here the the summation of this one of the things you can also do is by axes so we could do axes equals zero and if we run the summation of the axis equals 0 and you can think of that in numpy as the rows so that would be or you can think of that in numpy as being the columns we're summing these columns going across and you can also change this to one and now we're summing the rows and so that is the summation of this row and so forth and so forth going down and maybe you don't need to know the summation maybe what you're looking for is the minimum so here's our minimal you're looking for and this comes up a lot because you have like your errors we want to find the minimal error inside of this array and just like the other one we can do axes equals zero and you can see here 0.0645 is the smallest number in this first column is 0.0645 and so on and if you have a minimum well you might also want to know the max maybe we're looking for the maximum profit and here we go you can see maximum 0.79 is the maximum on this first column and just like we did before you can change this to a 1 on axes you can take the axes out of here and just find the max value for the whole array and the max value in here was 0.8344 so on so on and since we're talking data analytics we want to go ahead and look at the mean pretty much the same as the average this is the mean across the whole thing and just like we did before we could also do axes equals zero and then you'll see this is the mean of this axis and so on and we have mean we might want to know the median and there's our median our most common numbers uh if we have median we might want to know the standard deviation or if we have the average a lot of times you do the means in the standard deviation we can run that and there's our standard deviations along the axes we can also do it across the whole array if we're going to do standard deviations there's also variance which is your var and there's our variance across the different levels and so if we looked at that we looked at variance we looked at standard deviation the median and the means there's more but those are the most common ones used with data analytics and then going through your data and figuring out what you're going to present to the shareholders and some other things we can do is we can actually take slices you'll hear that terminology and a slice might be like we have a five by five array but maybe we don't want the whole array maybe we want uh from one on we don't want the zero in there so we got up to four and maybe on the second part we just want two to row three and see this notation right here says one to the end and if we run this you can see how that generates a single row to the end and then row two and three now remember it doesn't include three that's why we only get the one column so if you wanted two and three you would need to go ahead and go two to four so it goes up two four we could also do this in reverse just like we learned earlier we can go minus one oops and when we go to minus one it's the same thing because we have zero one two three four this is the same thing as two to four goes two to the last one also very common with arrays is you're going to want to sort them so we still have our array up here that we randomly generated and we might want to sort it and we'll go and throw an axis back in there uh axes equals one if we run this you can see from the axes that it sorts it the point two being the lowest value to the highest value by the row we can also change this of course to axis zero if you're sorting it by column so maybe your values are based on columns and then of course you can do the whole array and we can sort that don't usually do that but you know i guess sometimes you might that might come up and so you can see right here we have a nice sorted array uh something now let's just go ahead and reprint our array so we can look at it again starting to get too many boxes up there something else you can do with an array is we can take and transpose it this comes up more than you would think when you transpose it you'll see that the rows and the column are transposed so where 0.79.57 0.064 is a column now we've switched it and we have 0.79.42 as the index you can see this really more dramatic if we take a slice and we'll just do a slice of the first couple and then we'll just do all the other um the full rows and if we run this you can see how it comes up a little bit different and we'll just do the same slice up here so you can see how those two look next to each other there we go there's our slice run and so you can see the slice comes up and it has one two three four five columns now we have one two three four five rows and three columns versus three rows and the original version when they first started putting this together uh was a function so the original version was transpose and this still works you can still see it generates the same value as just a capital t so many times we flip this data because we'll have an x y value or we'll have an image or something like that and it's being read one way into the next process and the next one needs it the opposite so this actually happens a lot you need to know how to transpose the data really quick and we can go ahead oh let's just take um here's our transpose we'll just stick with the transpose on here and instead of doing it this way we might need to do something called flattening why would you flatten your data uh if this is an array going into a neural network you might want to send it in as one set of values instead of two rows and you can see here is all the values as a single array it just flattens it down into one array so we covered our scientific means transpose median some different variations on here some of the other things we want to do is what happens if we want to append to our array so let's create a new array you're getting tired of looking at the same set of random numbers we generated earlier so we'll go ahead and create a new array here something a little simpler so it's easier to see what we're doing and four five six seven eight uh that's good enough i'll just do four five six seven eight and if we print this array there it is four five six seven eight and we might wanna append something to the array so we have our array we need to extend it you've got to be very careful about appending things to your array and there's a number of reasons for that one is run time because of the way the numpy ray is set up a lot of times you build your data and then push it into the numpy array instead of continually adding on to the array and then it also usually it automatically generates a copy for protecting your data so there's a lot of reasons to be careful about appending this way but you can certainly do it and we can just take our array we're going to create a new array array one and if we print array one and we append eight to it you'll see four five six seven and then there's our 8 appended onto the end and if you want to append something to an array you'd probably also want to oops array 1. let's try that again there we go now we have the 8 appended onto the end so you can see four five six seven eight and then we pinned it another eight on there and if you're going to append something you might want to go ahead and insert instead of appending it might be you need to keep a certain order and we can do the same thing we do our array and we're going to pin or insert at the beginning and let's go ahead and insert uh one two three one two three and we go ahead and print our array two we run it and you can see one two three a pin is inserted at the beginning uh inserts a lot more powerful and that you can put it anywhere in the array we can move it to the one spot and there we go one two three uh we can do a minus one just for fun and you'll see it comes up one two three and we're counting backwards by one i imagine do a minus zero and run this and it turns out that minus 0 puts it back at the beginning because that's why it registers a 0 just takes a minus sign off and just like we add numbers on we might want to delete numbers and so let's do an n p dot delete well let's keep it a little bit make it a little easy here um to watch we'll go ahead and create an array three and we'll do np delete we're just working with array two and we want to do is delete zero space so if you look at this here's our array 2 our rate 2 starts with 1 and when we delete the space on here and print that out we deleted the one right out of there and we can also do something like this where we can do it as a slice and we can do let's do one comma three and if we run one comma three you'll see we've deleted the one space and the three space out which deleted our two and four now keep in mind when you're messing with adding lines and deleting lines you have to be really careful because there's a time element involved as far as where the data is coming from and it's really easy to delete the wrong data and corrupt what you're working on or to insert stuff where you don't want it so there's always a warning when we talk about manipulating numpy arrays and just like anything else we're doing uh we'll create an array c which equals we'll just do our our numpy array that we just created our number array 3 and we can do copy so you can make a copy of it maybe you want to protect your original data or maybe you're making a mask and so you copy the array and then the new array make all these alterations and change it from values to zero to one to mask over the first one and of course we if we do array c since it equals a copy of uh array three it's the same thing one three five six seven eight and now we're getting into uh combine and split arrays i end up doing a lot of this and i don't know how many times i end up fiddling with this and having a mess so but but you do it a lot you know you combine your arrays you split them you might need one set of data for one thing another set of data for the other so let's go ahead and create two arrays array one or a two and i want you to note in the terminology we're gonna look for is concatenate what that means is we're going to take um we'll call this a raycat i like a raycat there we go um our array cat our concatenated array we're taking array one and two and it's very important to really pay attention to your axes and your counts i can't merge two arrays that have like if their axes are messed up and i'm merging on axis 0 it's going to give me an error and i'll have to reshape them so you got to make sure that whatever you're concatenating together works and what that means as you can see here we have one two three four one two three four and then five six seven eight five six seven eight along the zero axes these each are four values um so it's a two by four value and if we go ahead and switch this to one you can see how that flips it a little bit so now we have one two three four five six seven eight it's interesting that we chose that one if i did something like this where this is now there we go and we concatenate it run this and it gives me an answer okay because i have two by two and i'm using axes one but if i switch this to axis zero where now it's got three and five it gives me an error so you gotta be really careful on that to make sure that your whatever axes you are putting together that they match um so like i said this one oops x's one axis one has two entities and since we're going on axes one or by row you can see that it lets it merge it right onto the end there and you could imagine this if this was a xy plot of value or the x value going in and the predicted y value coming out and then you have another prediction and you want to combine them this works really easy for that and we'll go back and let's just put this back to where we had it oops i forgot how many changes i made there we go i'll just put it oops i messed up in my concatenation order here there we go okay so you can see that we went through the different concatenation axis is really important when you're doing your concatenation values on here and we'll switch this back to one just because i like the looks of that better there we go two rows now there are other commands in here so we can do cat v equals npv v stack this is nothing more than your concatenation but instead we don't have to put the axes in there because it's v stands for vertical and so if we print out cat v and we run this you can see we get the one two three four one two three four and that would be the same as making this axis zero for vertical stack and if you're going to have a vertical stack you can also have an h stack so if we change this to from v stack to oops here we go h stack and we'll just change this from cat to cat and i run this it's the same as doing axis zero the process is identical in the background um this is like a legacy setup your v-stack and your h-stack most people just use concatenate and then put the axes in there because it's much has a lot more clarity and is more more commonly used nowadays the last section in numpy we're going to cover uh is underst is kind of uh data exploration and that'll make a little bit more sense in just a moment sometimes they call them set operations but let's say we have an array one two three four five six three whatever it is i think we generate a nice little array here and what i want to go ahead and do is find the unique values in that array uh so maybe i'm generating what they call a one hot encoder and so these values then all become i need to know how long my bit array is going to be so each word how many how many each word is represented by a number and then i want to know just how many of those words are in there if we're doing word count very popular thing to do and you can see here when we do unique we have one two three four five six those are our unique values some of the things we can do with the unique values is we can also instead of doing just unique we can do uniques our new unique values and counts of each unique value and this is very similar to what we just did up here where we uh we're doing np unique but we're going to add a little bit more into there and it's just part of the arguments in this and we want to do return counts equals true so instead of just returning the unique values we want to know how many of those unique values are in each one and we'll go ahead and print our uniques and print our counts when we run that uh you can see here we have our unique value one two three four five six just like we had before and then there's two of the first of two ones two twos two threes two fours one five two sixes and so on and you can go through and actually look at that if you want to count them but a quick way to find out your distribution of different values so you might want to know how often the word the is used versus the word and if each word is represented as a unique number and along the set variables we might want to know let me just put a note up here we're going to start looking at intersection and we might want to also know differentiation and neither so when we're whoops neighbor neither so what we're looking at now is we want to know hey where do these two arrays intersect and we have one two three four five three four five six seven we might wanna know what is common between the two arrays um and so when we do that we have np intersect and it's a 1d array one dimensional array and then we need to go ahead and put array one array 2. and if we run this we can see they intersect at 3 4 5 that's what they have common and because we're going to go ahead and go through these and look at a couple different options let's change this from intersect 1d and we'll do the same thing we'll go ahead and print this so we might want to know the intersection where they have commonalities another unique word is union of 1d so instead of intersect we want to know all the values that are in both of them so here's our union of 1d when we run that you can see we have 1 2 3 4 5 6 7 so that's all the different values in there and the last one of the last words we have two more to go as we want to know what the set difference is uh and so that's where the you'll see if you remember set we talked about that being the what they call these things so the set difference of a 1d array when we run that you can see that one is only in one array and two is only in one array and if we want to know uh what's in array one but not in array two we might wanna know what is in array one but not two and what's in two but not one and this would be the set xor one d on here so we have the four different options here where we can do an intersection what do they both have in common we can do a union what are all the unique values in both arrays we can see the difference what's in array 1 but not array two so set diff one d and then set x or what is not in one but is in two and what is in not in two but in one so we dug a lot in numpy because we're talking there's a lot of different little mathematical things going on in numpy a lot of this can also be done in pandas although usually the heavy lifting is left for numpy because that's what it's designed for let's go ahead and open up another python 3. setup in here and so we want to explore what happens when you want to display this this is where it starts getting in my opinion a little fun because you're actually playing with it and you have something to show people and we'll go ahead and rename this we're going to call this pandas and pie plot so pandas pie plot just so we can remember for next time and we want to go ahead and import the necessary libraries we're going to import pandas as pd now remember this is a data frame so we're talking rows and columns and you'll see how pandas work so nicely when you're actually showing data to people and then we're going to have numpy in the background numpy works with pandas so a lot of times you just import them by default seaborn sits on top of the matplot library so sometimes we use the seaborn because it kind of extends it's one of the 100 packages that extends the matplot library probably the most commonly used because it has a lot of built-in functionality almost by default i usually just put cborn in there in case i need it and of course we have matplot library as pi plot as plt and note we have as pd as np as sns as plt those are pretty standard so when you're doing your imports i would probably keep those just so other people can read your code and it makes sense to them that's pretty much a standard nowadays and then we have the strange line here uh it says amber sign matplot library inline that is for jupiter notebook only so if you're running this in a different package you'll have a pop-up when it goes to display the matplot library you can with the most current version of jupiter usually leave that out and it will still display it right on the page as we go and we'll see what that looks like and then we're going to go ahead and just do the seaborn the sns.set and we're going to set the color codes equals true let them just keep the default one so we don't have to think about it too much and we of course have to run this the reason we run this is because these values are all set if we don't run this and i access one of these afterward it'll crash the cool thing about jupiter notebooks is if you forgot to import one of these you forgot to install it cause you do have to install this under your anaconda setup or whatever setup you're in you can flip over to anaconda and run your install for these and then just come back and run it you don't have to close anything out and we'll go ahead and paste this one in here real quick we have car equals read underscore csv and then we have the actual path this path of course will vary depending on what you are working with so it's wherever you save the file at and you can see here i have like my onedrive documents simply learn python data analytic using python slash car csv it's quite a long file when we open that up what we get is we get a csv file and we have the make the model the year the engine fuel type engine horsepower cylinders and so on and this is just a comma separated file so each row is like a row of data think of it as a spreadsheet and then each one is a column of data on here and as you can see right here it has the make model so it has columns for a header on here now your pandas just does an excellent job of automatically pulling a lot of this in so when you start seeing the pandas on here you realize that you are already like halfway done with getting your data in i just love pandas for that reason numpy also has it you can load a csv directly into numpy but we're working with pandas and this is where really gets cool is i can come down here and i can print remember our print statement we can actually get rid of it and we're just going to do car head because it's going to print that out the head is going to print the top values of that data file we just ran in and so you can see right here it does a nice printout it's all nice and inline because we're in jupyter notebook i can scroll back and forth and look at the different data and just like we expected we have our column and it brought the header right in one thing to note is the index it automatically created an index 0 1 2 3 4 and so on and we're just looking at the head so we got 0 1 2 3 4. you can change this you might want to just look at the top two we can run that there's our top two bmws another thing we can do is instead of head we can do tail and look at the last three values that are in that data file and you can see right here it numbered them all the way up to eleven thousand nine hundred thirteen oh my goodness they put a lot of data in this file i didn't even look to see how big the file was uh so you can really easily get through and view the different data in here when you're talking about big data you almost never just print out car in fact let's see what happens when we do if we run this and we just run the car it's huge in fact it's so big that the pandas automatically truncates it and just does head plus tail so you can see the two um so we really don't want to look at the whole thing i'm going to go back to let's stick with the head displaying our data there we go so there's a head of our data gives us a quick look to see what's actually in there i can zoom out if we want so you can actually get a better view although we'll keep it zoomed in so you can see the code i'm working on and then from the data standpoint we course want to look at data types what's going on with our data what does it look like now this you know you show your when you're talking to your shareholders they like to see these nice easy to read charts they look like a spreadsheet so it's a nice way of displaying pieces of the chart we talk about the data types now we're getting into the data science side of it what are we working with well we have make model we have an integer 64 for the year uh engine fuel type is an object if we go up here you can see that there most of them are like you know it's a set manual rear wheel drive so they might be very limited number of types in there uh and so forth and you'll it's either gonna be a float64 an integer or an object is the way it's going to read it on here and the next thing you're going to know is like your columns and since it loaded the columns automatically we have here the make the model the year the engine the size all the way up to the msrp and just out of something you'll see come up a lot is whenever you're in pandas you type in dot values it converts it from a pandas list to a numpy array and that's true of any of these uh so then you end up in a numpy array so you'll see a little switch in there in the way that the data is actually stored and that's true of any of these uh in this case we want car dot columns you have a total list of your car columns and like any good data scientist we want to start looking at analytical summary of the data set what's going on with our data so we can start trying to piecemeal it together so we can do car describe and then we'll do is we'll do include equals all so a nice panda command is to describe your data if you're working with r this should start looking familiar and we come down here and you can see count there's a make the model the year how many of each one how many unique values of each one the top value of each one what's most common the frequency the mean clearly on some of these it's an object so really can't tell you what the average is it'd just be the top ones the average i guess the year what's the average year on there all this stuff comes down here your standard deviation your minimum value your maximum value uh what's in the lower quarter 50 mark where's that line at and what's in the upper 75 percent the top 25 percent going into the max now this next part is just cool uh this is what we always wanted computers to be back like in the 90s instead of 5 000 lines of code to do this maybe not 5 000. all right i built my own plot uh library back in 95 and the amount of code for doing a simple plot was um i don't know probably about 100 lines of code this is being done in one line of code we have our car which is our pandas we generated that it's our data frame and we have dot hist for histogram that is the power of seaborn now it's still going to generate a numpy graph but seaborn sits on top and then we can do the figure size this is just um so it fits nicely on the paper on here and we do something simple like this and you can see here where it comes up and does a map plot library and does subplots and everything but we're looking at a histogram of all the different pieces in our database and we have our engine cylinders that's always a good one because you can see like they have some that are they had a null on there so they came out as zero um maybe a couple maybe one of them had a two-cylinder engine away back when four is a common uh six a little less common and then you see the eight cylinder uh 12-cylinder engines whether it's got to be a speedster or something but you can see right here just breaks it down so now you have how many cars with how many whatever it is cylinders horsepower and so on and it does a nice job displaying it you can see if you're working with your uh um you're going into your demo it's really nice just to be able to type that in and boom there it is it can see it all the way across and we might want to zero in and use like a box plot and this time we'll go ahead and call the um seaborn sns box plot and we're going to go ahead and do vehicle size in versus engine horsepower xy plot and the data comes from the car so if we run this we end up with a nice box plot you see our mid-size compact and large you can see the variation there's our outlier showing up there on the compact that must be a high-end sports car a large car might have a couple engines and again we have all these outliers and then your deviation on them very powerful and quick way to zero in on one small piece of data and display it for people who need to have it reduced to something they can see and look at and understand and that's our seabourn box plot our sns.box plot and then if we're going to back out and we want a quick look at what they call pair plotting we can run that and you can see with the seaborn it just does all the work for you it takes just a moment for it to pull the data in and compile it and once it does it creates a nice grid and this grid if you look at this one space here which is you might not be able to see the small number it says engine horsepower this is engine horsepower to the year was built and it's just flipped so everything to the right of the middle diagonal is just the rotation of what's on the left and as you expect the engine horsepower gets bigger and bigger and bigger as time goes on so the the year it was built the further up in the year the more likely you are to have a heavy horsepower engine and you can quickly look at trends with our pair plot coming up and look how fast that was that was it took a couple a moment to process but right away i get a nice view of all these different information which i can look at visually and and kind of see how things group and look now if i was doing a meeting i probably wouldn't show all the data one of the things i've learned over the years is people myself included love to show all our work you know we're taught in school show all your work prove what you know the ceo doesn't want to see a huge grid of graphs i guarantee it uh so we want to do is we want to go ahead and drop the stuff that might not be interested in and we're gonna i'm not really a car person a guy in the back is obviously so you have your engine fuel type we're going to drop that we're going to drop market category vehicle style popularity number of doors vehicle size and we have the axes in here if you remember from numpy we have to include that axis to make it clear what we're working on that's also true with pandas and then we'll look at just that what it looks like from the head and you can see that we dropped out those categories and now we have the make model year and so forth and we took out the engine fuel type market category etc and this should look familiar to you now when you start working with pandas i just love pandas for this reason look how easy it is it just displays it as a nice spreadsheet for you you can just look at it and view it very easily it's also the same kind of view you're going to get if you're working in spark or pi spark which is python for spark across big data this is the kind of thing that they they come up with this is why pandas is so powerful and we may look at this and decide we don't like these columns and so you can go in here and we can actually rename the columns simple command car equals car rename columns equals engine horsepower equals horsepower this is just your standard python dictionary um so it just maps them out and you know instead of having like a lengthy effect here we had engine horsepower we just want horsepower we don't need to know what's the engine horsepower engine cylinders we don't need to know that it's for the engine there's only one thing we're describing if we're talking about cars and that cylinders and we'll go ahead and just run this and again here's our car head and you can see how that changed we have model year in horsepower versus model year engine horsepower engine cylinders and just cylinders again we want to keep reducing this so it's more and more readable the more readable you get it the better and of course we can also adjust the size a little bit so that when it prints out instead of splitting it on two lines we get like a single line we can do that also that's just your control mouse up or plus sign you use in chrome that's a chrome command and if you remember from numpy we had shape well pandas works the same way we can look at the shape of the data so we now have 11 914 rows and 10 columns so you'll see some similarities because pandas is built on numpy and questions that come up just like you did in numpy we might want to know duplicate rows and so we can do car and look at this switch here we're doing a selection this is a pandas selection with the brackets but we want to select it based on car dot duplicated so how many duplicates on there so we're starting to look a little bit different as far as how we access some of the data on here this can be a logical statement and we get the number of duplicate rows we have 989 rows by 10 columns again and this is one of those troubleshooting things that we end up doing a lot more than we really feel like we should uh we might go ahead and do like a car count uh just to see how many rows we're dealing with and then right after that we might want to go ahead and say hey um let's drop duplicates so remember we did all the duplicates on there so car equals car dot drop duplicates and then we can print the head again we'll just do car head here and you can see the data on there um looks the same as before and just note that we did car equals car drop duplicates there are commands in here where you can do where it changes the actual value and it works on some of them and not on others depending on what you're doing but by default it always returns a copy so when we do this we're reassigning it to car and you can see it's the same header but we want to go ahead and do count and see how the count changes let's go ahead and run this and you can see here instead of 11 914 we have 10 925 uh so we've removed about 100 cars that were duplicated just slightly under 100 there and then as we're prepping our data we might want to know um car is null so it's going to count the values of null and then we want to sum that up and when we do that we do the car is null function dot sum uh we end up with uh hp the horsepower is 69 have no values and 30 have cylinders have no values now if you don't put the sum at the end it's just going to return a mask with the true false of is it null or is it not by zero and one so you're summing up the ones underneath each column and this of course then you have to decide what you're going to do with the null values there's a lot of different options it might be that you need to put in the average or means maybe you want to put in the median value there's a lot of different ways to fill it usually when you first start out with the data a lot of them you just drop your null values and you can see here car dot drop in a which is equal to all and then we're going to go ahead and count it and you can see that we've dropped almost another 100 values so from 10 19 25 to 10 8 27 maybe 75 or so values so we clean that this is really a big part of cleaning data you need to know how to get rid of your null values or at least count them and what to do with them and of course if we go back to counting our null values we should now have no null values there we go and you'll see there's zero null values i don't know how many times i've been running a model that doesn't take no values and it crashes and i just sit there and look at it trying to get why did that crash it should have worked uh it's because i forgot to remove the null values so we've been jumping around a lot we're going to go back to uh finding outliers and let's go ahead and bring that back into our seabourn and if you remember we did a box plot earlier this time we're going to do a box plot just on the price and you can see here our price value and we have the deviation with the two thinner bars on each side of the main value and then as we get up here we have all these outliers in fact we have one way out here that's probably a really expensive high-end car is what we're looking at if you were doing um fraud analysis you would be jumping on all over these outliers why are these deviation from the standard what are these people doing again this is probably like i said a really high-end expensive car out here that's what we're looking at and we can also look at the box plot for the horsepower and we'll put that in down here and run that and you can see again here's our horsepower and it just jumps and there's these really odd huge muscle cars out here that are outliers and we're going to jump into making this a little bit more as you start displaying your data or your information to your shareholders we're going to look at plotting a histogram for the number of cars per brand and the first thing we want to go ahead and do is we have with our car go back over here here we go uh we have our make value counts largest plot and we're going to do a kind equals bar fig size 10 5. and right off the bat we jump up here we see chevrolet it's going against what was it it's figure race the value counts and we want the largest value so here's our value counts and compared to what the different cars are chevrolet puts out a lot of different kinds of cars i didn't realize that they made that many cars or different types and then for readability let's go ahead and add a title number of cars by make number of cars and make if you looked at this the first time you would have been like what the heck am i looking at well we're looking at the number of cars by make and then you can see here now we're talking about the type of cars and the different ones are put out lotus i guess only had a few different kinds of cars over there very high-end cars and then as uh doing data analytics and as a data scientist one of the things i am most interested in is the relationship between the variables so this is always a place to start we want to know what's going on with our variables and how they connect with each other so the first thing we're going to do is we're going to go ahead and set a figure size because we want to make sure it fits our graph we'll just go ahead and set this one plot figure set to figure size 2010. if you never use the matplot library which is sitting behind seaborn whatever is in the plt this is what's loaded it's like a canvas you're painting on so the second you load that pie plot as plt anything you do to that is affecting everything on it and then we want to go ahead since we're using seaborn we'll go ahead and create a variable c for relationships or correspondence and car.corr that's a correlation in seabourn on top of pandas again one line and you get the whole correlation on there and because we're working with seaborn let's put it into a nice heat map if you're not familiar with heat maps that means we're just using color as part of our setup so we have a nice visual and we can see here that the seaborne connected to the pandas prints out a nice chart we'll talk a little bit about the color here in a second it prints out a nice chart this is a chart i look at as a data scientist these are the numbers i want to look at and we'll just highlight one of them here's cylinders versus horsepower the closer to one the higher the correlation so 0.788 pretty high correlation between the number of cylinders and how heavy the horsepower is i'm betting if you looked at the year versus horsepower we just look at that one here's year and horsepower 0.314 not as so much but if you combine them you don't actually add them but if you combine them you'll start to see an increase in horsepower per year and cylinders you could probably get a correlation there and just like 0.78 is a positive correlation you might notice if we look at cylinders and or let's look at horsepower and mileage so if we go here to horsepower to mileage you get a nice negative we'll do cylinders that's a bigger number with cylinders to the miles per gallon it's a minus 0.6 so it's a negative correlation the closer to -1 the more the negative correlation is and then the chart you would actually show people is a nice heat map this is all our colors and it's just those numbers put into a heat map the darker the color the higher the correlation you can see straight down the middle um obviously the year correlates directly with the year horsepower with horsepower and so on that's why it's a one the closer the one the higher the correlation between the two pieces of data now this is a good introduction pandas goes way beyond this most the functionality in numpy since panda sits on it is also in pandas and then it even has additional features in it and we use seaborne pretty extensively sitting on top over our pie plot so keep in mind that our pie plot has a ton of other features in it that we didn't even touch on in here we couldn't even if you had a soul course in it uh there's just so many things hidden in there depending on what your domain you're working on but you can see here here's our seaborn and here's our matplot library that's all our graphics that we did and then the seaborne works really nicely with the pandas we really like that so that wraps up our demo part for today let's learn about data manipulation in r and here we will learn about d player package and when we talk about this d player package it is much faster and much easier to read than base r so d player package is used to transform and summarize tabular data with rows and columns you might be working on a data frame or you might be getting in a inbuilt r data set which can then be converted into a data frame so we can get this package the plier by just calling in library function and this can be used for grouping by data summarizing the data adding new variables selecting different set of columns filtering our data sets sorting it selecting it arranging it or even mutating that is basically creating new columns using functions on existing variables so let's see how we work with dplyer now here i can basically get the package here so i can just say install dot packages d plier now we already see the the package here which is showing up so i will just select this one i can do a control enter and that will basically set up the package package d player successfully unpacked so that is done now you can start using this package by just doing a library d player and this was built it shows me my version of r so let's also use a inbuilt data set that is new york flights 13 so we can do install dot packages and that will search and get that relevant data set i can again call it by using library function now once that is done we can look at some sample data here by just doing view flights and that shows me the data in a neat and a tabular format which shows me year month day departure time schedule departure time and so on now we can also do a head to look at some initial data which can help us in understanding the data better so what is this data about how many columns we have what are the data types or object types here it shows me how many variables we have so this is fine now we can start using the plier and in that we can use say filter function if we would want to look in for specific value now here we have the column as month so i will do a filter now i'm creating a variable f1 i'm using the filter function on flights which we already have and then what we can do is we can basically look at the month where the month value is 0 7 so let's look at that and this one you can do a view on f1 which shows me the data wherein you have filtered out all the data based on month being 7. so this is a simple usage of filter we can take some other example we may want to include multiple columns so we can say f2 filter flights and here we will say month is equal to 7 day is 3 and then look at the value of f2 if you are interested in seeing this and that tells you the month is 7 and days 3 you could also look into a more readable format by using view on f2 and that gives me my selected result so we are just extracting in some specific value we can keep extending this so here we can say flights is what we would want to work on i'm using the filter function so i can straight away instead of creating a variable then then doing a view i can also do a view in this way i can just pass in my filter within the view and within this i am saying filter i would want to look at the flights month being 0 9 day being 2 and origin being lga and then that shows me the value here and obviously you can scroll and look at all the columns and if you see the origin column it shows the selected value so now we have filtered out our data based on values in three different columns now what we can also do is we can use and or we can use or operators so i could have done this in a a little different way so i could have said head which shows me initial result i will do a flight so within my head function i am passing in this and what does that contain so you are saying flights and in this flights data set you would want to pick up the month being the column so we use the dollar symbol here we given a value and i'll say and and i'll again say flights wherein i will select the day being two and and and remember when you talk about and it is going to check if all the values are met true so then you say flights origin lge a and you look at the value so in this way i can filter out specifically multiple values by specifying columns now we could have done it in this way we could have created a view or we could have assigned this to a variable and then done a view on that where we could have selected month being day and origin or you can be more specific in specifying all the columns it makes the code more readable so let's look at the values and here you are looking at head which shows me based on month day and then you can look for further columns for other variables that is origin being lga now what we can also do is we can do some slicing here to select rows by particular position so i can say slice and i would want to look at rows one two five and i can do this so you can always assign or look at the view of this i can just do here so when i did a slide one is to five it shows me my entries for one to five now similarly we can do is slice 5 to 10 and now you are looking at 5 to 10 values so you can always look at the complete data and then you can slice out particular data now mutate is usually a function which is used when you would want to apply some variable on a particular data set and then you would want to add it to your existing data frame or you would want to add a new column so this is where you use mutate which is mainly used to add new variables so let's see how you work on mutate so it's pretty simple so you create a variable over delay now i would want to do a mutate so that it adds a new column so i'm selecting my data which is flight i will call the new column as overall delay and then basically i can look at overall delay being arrival delay minus departure delay so let's create this and let's look at view of this which shows me or which should show me my new column which is overall delay which was not in my original data set so you can anytime do a head on this one to compare the value so this one shows me arrival delay and then there are many other variables what you can also do is you can do a view and you could have just look at flights if you would want to compare so you can look at the flights and this one would not have any overall delay column so it basically shows me 19 columns only what we see here and if you do a view on overall delay then that basically shows me 20 columns so we know that the new column has been added to this overall delay so if you would want to work with 20 columns you will use overall delay if you would want to work with your original data set you will use flights now you can also use a transmute function which is used to show only the new column so we can do an overall delay and at this time we will say transmute we will say flights overall delay the computation remains same but at this time if i look at view on overall delay it only shows me the new column so sometimes we may want to compute result based on two variables or two columns and just look at the new value and then we can decide if we would want to add it to our existing structure now you can also use summarize and summarize basically helps us in getting a summary based on certain criteria so we can always do a summarize and what we can do is we can look at our data and we can say on what basis we would want to summarize this particular data so we can do a summarize function now summarize on flights i will say average a time and i would want to calculate an average so for that i am using inbuilt function called mean i will do that on airtime column so let's look at flights once again and here we can see there is arrival time not a time sorry arrival time and we would want to do some average on this particular data we would want to summarize this so what i'll do is i will use the summarize function i will say average airtime and this one i will look at mean of a time so let's see if there is a a time column i might be let's look at this one and i will delay and yes we have an airtime so we were actually looking at summarizing based on airtime not the arrival time so air time is how much time it takes in air for this particular fight and we will want to use the trans summarize function not the transmute so summarize flights average a time and this one we will calculate the mean of average a time and i will also do a any removal which is i am saying true so let's do this and that basically shows me the average a time is 151 i can also do a total a time where i am doing a summation of values or i can get the standard deviation or i can basically get multiple values such as mean i can say total airtime where i am doing a summation and then i can look at other values which is if you would want to put in standard deviation here you could do that so let's look at the result of this summarize and this basically allows me to get some useful information which is summarized based on a particular function such as mean sum standard deviation or all three of them now let's look at grouping by so sometimes we may be interested in summarizing the data by groups and that's where we use the group by function so we can always use the group by clause now here we are taking a different data set so we will say for example let's look at head of mt cars and that is basically my data set on empty cars now that shows me the model of the car it shows me my lathe cylinder part this and your horsepower and various other characteristics or variables in this particular data set so here we can say let's do a grouping by gear so there is a column called gear so i will call it by gear i will look at my data set and then what i am using here which you see with these percentage and greater symbol is called piping so that basically feeds your previous data frame into next one so this is sometimes useful and you can get this by just saying control shift and m and you can then use this so we are going to have piping so i am saying empty cars now this is my original data set where i did a head or i could have done a view on this one if you would want to see it in a more readable format and that basically shows me the data so we are using a different data set so i want to group it by the gear column so i'm going to call it by gear and this one takes my data that is empty cars i'm using the piping and then i'm saying group the data based on gear column that's done now let's look at the value of by gear or you can always do a view so remember whenever you're doing a group by it is giving you a internal object where your data is grouped based on a particular column so we can look at the values here you can do a view that shows you your data grouped based on a particular column now i can again use the summarize function where i would want to now work on the new one where it was grouped based on gear so i am doing a summarize and here i am going to say gear 1 which will be having the value of summation on the gear column and then i am saying gear 2 which is mean well you could give some meaningful names to this and let's look at the value of this one where we are basically now looking at the values which is sum and mean values based on the gear similarly we can use look at different example so we can say by gear and i'm again using piping but earlier we had taken gear we had grouped the data and we called it by gear so we took our original data set empty cars but now within this particular data which was grouped by gear i will take this data set i will use the piping and i will summarize it where i am saying within this particular data set i would want to get the sum or i would want to get the mean and then you can look at the values so what you are doing is you are either looking at your original data set or you're looking at the data which was already grouped and then you can look at the values now here what we can do is we can group by cylinder say might be you are interested in looking at data which is summarized based on the cylinder column you can do that and then for this by cylinder i am doing a piping where i am using the summarize function and summarizing will then be done based on the mean values of the gear column or the horsepower so let's do this and then you can basically look at the value at any point you may want to look at the data set again so just go ahead and you can look at what does the value contain and by cylinder or by gear and do a head and it gives you the value so you can always do some summarizing or grouping in these ways now here we are going to use sample underscore n function and sample underscore fraction for creating samples so for this let's take the flights data set again and we would want to get 15 random values now that is done and it shows me 15 rows with some random values from the data what you can also do is you can do a portion of data by using sample underscore fraction and here i'll say flights i'll say 0.4 which will return 40 percent of the total data so this can be useful when you are building your machine learning where you would want to split your data into training and test might be you are interested in some portion of the data so you can do this which is very useful function and then you can look at the value of that now what we can also do is we can use a range function so like we were doing a grouping by or we were trying to pull out a particular column so in the same way we can use a range which is a convenient way of sorting than your base are sorting so for range function let's do a view based on a range so we will work on the flights data set which we have and here what we would want to do is we would want to arrange the flights data set which is based on year and departure time and we are doing a view out of it so that basically gives me the data which is arranged based on your year and departure time now i can do a head to give me some highlighting of that data now the piping operator what we are using can be used in these ways also so here i will say df i will just assign the data set empty cars to it let's look at the df which has basically your different models you can obviously look at the head or view of it to look at useful information we can also go for nesting options which can be useful so we are creating a variable called result here now that has the arrange function so what does this arrange function do so when we would want to use arrange to sort the data so i would want to sort the data but what data would i sort so i will use sample n which will give me some portion of the data or some sample data now what is that sample data so here we are using nesting that is earlier when we did a sample we just said data and how many random samples we want but instead of giving that what we are going to do is we are going to use filter here now this filter will work on df so filtering will happen based on the mileage which is greater than 20 i will say size is 5 and i would want to basically arrange this in a descending order so i'm using the des on this particular mileage column by default it is always ascending so let's get the result out of this which will basically show me the mileage details in a descending order so this is my data frame and now we can look at the result what we have created so just do a view or do a head and look at the view so here you see mileage where the highest value is on the top and we were only interested in five values in a random sample so that's why when you did a view it shows your five values and it shows in a descending order based on mileage so we have not only used an inbuilt function we have not only arranged the data that is we have sorted the data but we have sorted the data based on a descending order on a particular column we have said the value should be greater than 20 and we have also said we just need five random samples now let's look at some other examples so you can always do a multi assignment so i can say filter wherein i'm going to use df which was assigned empty cars i'm going to say mileage should be greater than 20 then i say b which is going to get a sample out of a and i just want 5 random values so let's look at that so we have b which is going to get a set of 5 values from a now i will create a result variable which will arrange b which is sample data in a descending order now let's look at the result of this and that basically shows me what we were seeing earlier so you can do a multi assignment where you can create a variable get a sample out of it and then basically whatever is that result you can arrange that or sort that in a descending or by default ascending order so same thing we can do it using pipe operator so piping so here i will say result i'm passing in my df that's the data set i'm using piping and which basically tells what you need to do on this particular data set so i'm going to filter out the data based on mileage 50 sorry mileage 20 then i'm going to push that or forward it to get the random sample and whatever is this random sample is going to be pushed so you are arranging this in a descending order so this is one more way of doing it and then basically you can look at the result so these are some simple examples where you can use your d plier with multiple assignments or using your nesting to filter out the data you can also do a arrange which is to sort the data you can get some random samples out of it you can summarize the data you can also summarize the data based on one or two or multiple columns and you can use some inbuilt functions to summarize the data based on some functions which are applied on the variables or on the columns you can transmute it where you would be interested in only looking at one column you can mutate it where you want to add a new column you can slice it and you can give the conditions where you can say and on or to filter out the data so what we can also do is on this particular data set which we have say for example df where i have my data let's look at this one and if i just do a df at this point it shows me my data set and if you would be interested only in particular column then your d player also allows you to either we can do a filter or we can simply do a select now for selecting we can choose uh our data so for example i'll say df underscore i'm interested in mileage i'm interested in horsepower might be i am interested in your cylinders in this and for this one what i can do is when i would want to do a select i can basically say selected df let's call it some name i can say control shift m which is for piping and then basically what you can do is you can do a select and you can choose your columns so i was interested in mileage i was interested in horsepower i was interested in cylinder and here what i'm doing is i'm using a select where i can look at the new data frame so let's do this and i'm sorry here we will have to give it df this is where you are passing in your data yeah now this one is done and we can look at the value of this one by just doing a df or head on df underscore mileage horsepower cylinder and look at the selected result so you can be looking at selective columns i could have done this filter but filter will always look for a condition say your mileage is greater than 20 or might be your cylinders are more than 4 or something else but when you do a select you are selecting specific columns so view always gives you all the columns head gives you highlight but then select can be useful when we are interested in looking at only specific data so this is how you can use the player for manipulation for your data transformation for basically filtering out the data by selecting particular data and then working on it so similarly there is one more package called tidr and we'll see how we can use data manipulation done using your tidr package let's learn about that idr package it makes it easy to tidy your data and this basically helps you creating a more cleaner data so which is easy to visualize and model now this comes with mainly four functions so you have gather which makes your data wide or it makes white data longer so that is basically used to stack up multiple columns you have spread function which makes long data wider that is stacking the data together or stack if you would want to unstack the data to data and you are talking about data which has same attributes and then your spread can spread the data across multiple columns you have separate which is function which splits single column into multiple columns and to complement that you have one more function which is unite and that combines multiple columns into single columns so these are four main functions which are used in your idr package so let's look how we work with this so let me bring up my r studio here now for this first is let me just clean up my screen here doing a control l so i will install the package it is already installed but we can just do a control enter and then i can say do you want to restart r prior to reinstall install i'll say okay and it is basically going to get the package now it says package tid tidy r that is idrs has been successfully unpacked let's use that package with using our library function and that was built under our version 3.6 now i can basically start using these functions so for example here we are creating a data frame so let's say n is 10 and then we basically would say we will call it white now that's the variable name i'm using the data.frame function i'm saying id which will be 1 to n so that will take the values from 1 to 10 and then these are the values which have ten entries so this is a vector phase one phase two phase three let's create a data frame out of it now that's done we can have a look at our data frame by just doing a view wide and that shows me the id column and it has face dot one phase dot two and face dot three now we can use our function so for example we can work with gather that is reshaping the data from wide format to long format and basically you can say stacking up multiple columns so let's see how we do that here i'll call it long i'm working on white i'm using the piping functionality and then i'm using gather so this one i will say what will be the data which i will use so we are using wide as a data frame then i am saying response time so that will be basically one more column and then you have your columns which you would want to basically stack so i'm saying from phase one to phase three so let's do this and once this is done let's have a look at our variable long so this one shows me that i have an id column i have the response time column and i have the face column which we mentioned and that basically has all the values stacked in so you have face dot one phase dot two and face dot three so if all the columns are being stacked here so all my data so now i have totally 30 entries in this one so this is basically using your gather function now sometimes we may want to use a separate function now separate function is basically splitting a single column into multiple columns so which we would want to use when multiple variables are captured in a single variable column okay so let's look at an example of this one so let's say long separate that's what we will call we will work on this long which has all the data stacked in as the columns we selected then i'm saying separate i want the face column and then i would say when i separate the columns what are my column names now i could also give a separator by giving a comma and then mentioning the separator if that is required so let's do this now once this is done let's have a look at our long separate so what we see here is the column which we used so we were doing a face column and that was to be split and we wanted to split it into target and number so that's what we see here so you have face being split into target and number and then you have the response time so this is how you use the separate function now there is also something called as unite function which is basically a complementing of separate function so it takes multiple columns and combines the elements to a single column so for example here we will call it long unite and we will take long separate which was separating the data we want to unite so we will take phase target number and we want to have a separator between them so let's basically do this and now let's look at the result of this unite so you see you have the face and target merged together so you have face dot one the separator is dot as we have mentioned and we have united multiple columns so this is one more function of your tie dr which helps you basically tidy up your data or put it in a particular way now then you have your spread function and this is basically for unstacking so that is if you have if you would want to convert a stack to data or if you would want to unstack the data which is of same attributes spread can be used so that you can spread the data across multiple columns so it will take two columns say key and value and spread it into multiple columns so it makes long data wider so we can look at this one we will say long unite i'm using the piping i will use the spread function i'll work on the face column and response time and let's do this and then let's do a view on this so it tells me our data is back in the shape as it was in the beginning so these are four functions which are very helpful when we work with idr package so let's learn about visualization and here we will learn about r which can be used for your visualization now one thing which we need to understand is because of our ability to see patterns which is highly developed we can understand the data better if we can visualize it so the efficient way or effective way to understand what is in our data or what we have understood in our data we should or we can use graphical displays that is your data visualization so there are actually two types of data visualizations so you have exploratory data visualization which helps us to understand the data and then you have explanatory visualization which helps us to share our understanding with others so when you talk about r r provides various tools and packages to create data visualizations and which can be used for both kind of data analysis or both kind of visualizations so when you talk about exploratory data and visualization the key is to keep all the potentially relevant details together now the objective when we talk about exploratory data analysis is to help you see what is in your data and the main question is how much details can we interpret now when you talk about different functions which we see here such as plot which is more for a generic plotting you have bar plot which is used to plot data using rectangular bars or you can say creating bar charts you have histogram or hist function to create histograms where you look at the frequency of the data are basically used to look at the central tendency of the data you have box plot which is used to represent data in the form of quartiles you have gg plot which is a package which enables the user to create sophisticated visualizations with the little code using the grammar of graphics and then you have plot lee or plot ly it creates interactive web-based graphs via the open source javascript graphing library now before we see some examples here let's also talk about when you talk about plotting let's also try to understand what kind of plots you can have and what kind of techniques you have so let me open up my r studio here now for example i can pull out a particular data set and let's look at this one so here i can look at all the panes and that shows me the information now what i can do is i can install and get the inbuilt data sets and then i can simply do a plot wherein i am doing a plot on chickweight data set so let's see what does that show it summarizes the relationship between four variables in chick-weight data frame which is in our's built-in data set package now from these plots we can see for example weight varies systematically over time you can also see that chicks were assigned to four different diets now when we talk about explanatory data analysis or visualization that shows others what we found in the data this means we need to make some editorial decisions what features we would want to highlight for emphasis what features are distracting or confusing and you want them to be eliminated right so there are different ways of doing it now when you talk about your graphics or visualizations you have i would say three different types or you can say four so you have the base graphics which is easiest to learn now here we are having an example of base graphics where i can use the base graphics i can get a data set using library then i can simply create using plot function to a generate a simple scatter plot of calories with sugar from u.s serial data frame in the mass package and then i can give it a title so this is basically a simple example of base graphics now you also have what we call as grid graphics which is powerful set of modules for building other tools now you also have latest graphics which is general purpose system based on grid graphics and then you have your gg plot 2 which implements grammar of graphics and is based on grid graphics so you have different ways now here since i already have used library and i have the data set i can just do a x so i can assign the sugar related values to x and calories related value to y then i can use one more which is library function and calling in grid now i can basically use functions such as push view port if i would want to create a plot using your grid graphics to create the similar kind of plot which we created using base graphics but this will give you much more power than base graphics it will have a steep learning curve but it is usually useful so i can do this where i'm saying push view port then i can basically say i would want to have a data viewport i would say different functions of your grid package so i am saying rectangle you have x-axis y-axis given some points here and then basically you can add details to the graph by giving the names to the columns and you can basically create a simple grid graphics based plot here now there are different other options which we can use to create plots now before we go into understanding how you create plots let me just give you a brief on what are the different kind of plots and how they can be used so here we will look at these different plots now for example we have a bar chart which is a graph which shows comparisons across discrete categories so you have x-axis which will show the categories being compared and y-axis which represents a measured value and height of the bars are proportional to measured values now to create different kind of charts you can use ggplot which is a package for creating graphs in r it is basically a method of thinking about and decomposing complex graphs into logical subunits and that is a part of tidy works ecosystem so it takes each component of graph accesses you can give scales you can give colors you can give the objects and you can build graphs on particular data you can modify each of those components in a way that's more flexible and user friendly you can if you are not providing details for the components then ggplot will use sensible defaults and this basically makes it a powerful and flexible tool now here are different options when you use your ggplot such as you can use geom or what we call as geometric objects to form the basis of different type of graphs for bar charts you have for line graphs you have scatter plots that is underscore point you have underscore box plot for box plots you have quartile for continuous x violin for richer display of distribution and jitter for small data so here is some simple example where i would not go into too many details here but you can just have a look at this one where we are using library function to get the ggplot2 package then basically we would want to look into the mileage data we would want to look at the structure of it and then we can basically get the tidy words package finally we can create a bar chart using geo underscore bar and we can basically also mention what would be in x-axis now you can also give different colors to basically add more meaning to your data you could also go for stacked bar charts so here we are actually telling ggplot to map the data in the drive column to fill the aesthetic so here i am giving aesthetic access class and i am saying what is the data we need to have and then we are using geom underscore bar so you can also have dodged bar in your gg plot that is not bar charts which are stacked but next to each other and you can create that by using your position as position underscore dodge okay now you can obviously use your different packages which are inbuilt and you can create your bar charts and you have other kind of graphs such as line graph which is basically a type of graph that displays information as a series of data points connected by straight line segments such as this one and for this one we are using if you see geom underscore line now you can also create a scatter plot which is a two dimensional data visualization that uses points to graph the values of two different variables one in an x axis one on y axis like what we saw in base graphics example and they are mainly used if you would want to assess the relationship or lack of relationship between two variables and you also have histogram which i mentioned is mainly to look at the distribution of a data to look at the central tendency of the data basically looking at your large amount of data for a single variable you would be interested in saying where is more data found in terms of frequency whereas lesser data found in the graph how close the data is towards its mid point or what we call as mean median mode so you can use histogram where you can categorize the data in what we call as bins so these are some basics on different kind of graphs now we can look at some examples and see how that works so what we were seeing is some quick examples of base graphics or grid graphics now here let's do an example of pie chart for different products and units sold so you want to create a graph for this first let's create a vector and pass in the value here now i can also create labels which i would want to assign to these values and then basically i can plot the chart by saying pi so that's the kind of chart which i would want to create and i would say the data would be x and labels so let's do this and that shows me a simple pie chart now i can also give main details here so instead of just doing a pi x comma labels i can say what is the main and then what kind of coloring it should follow so this is the way you can create a simple uh plot now i can also find out what is the percentage and then basically i would be interested in plotting the pie chart which takes x which takes the labels which will be the percentage which we are calculating here by doing a round function and then you can basically give details to your graph you can say what color it follows you can basically look at the legend where it needs to be in your chart what are the values and then basically fill up the colors so let's run this one and that shows me the percentage which was calculated and it gives me the details and we can always have a look at our plot now if you would want to go for a 3d pie chart then you can get the package which is plotrix let's use that by calling in the library function let's pass in some data to x and let's give some values or labels which will make more meaning to the data and then let's plot the 3d graph so i'm saying pi 3d here where i'm using x and labels then i'm basically doing an explode which will basically control how your graph looks like and basically give the values so it also takes the title when you say main and by chart of countries now let's create data for graph so again we are having a variable here we are create using the c function creating a vector and then let's create a histogram for this one where i would say x lab what would be your data around x axis what is the color what is the border and here i am creating a simple histogram which as i discussed earlier will always show your values on the x-axis and y-axis is more of frequency and then you can look at the set of values and what is their frequency and we can basically use this histogram for exploratory data analysis look at the data try to understand what is the central tendency of your data values now we can also give some limits by using the x lim and ylim and then i can also specify what is the limit so we have given some values here wherein we have said your x limit is 0 to 40 and y limit is 0 to 5. now if you compare this with the previous one which we had created this one based on the frequency had taken the limits but we can assign limits explicitly by giving this and then create a histogram which makes more meaning now let's take another data set that is air quality let's view this to see what does that data contain so you have ozone solar wind temperature month and the day so this is the kind of information we have in the air quality now let's use the plot function to draw a scatter plot where as i mentioned you would be interested in analyzing variables and see what is the relationship between them so to plot a graph between ozone and wind values so we will say plot we will say the data which is air quality from that i would be interested in the ozone column or ozone field and the wind field i can create a plot based on this now i can also be saying what should be the color what is the type of the data which you would want to create and you can look at the info information so you can create a histogram you can create a scatter plot to basically understand the data better and then infer some information from that data so let's take the air quality data set itself without specifying any particular column and you can create a plot which shows me all the different values which you have in the data and it basically shows you the difference this is more of an example like what we did for chickweight where we did a base graphics now you can assign labels to the plot so that is when you are creating a plot you can say air quality you will say ozone and then that's your ozone concentration you have your y lab which is the number of instances you have what is the title ozone levels in new york city what is the color so these are the details what we have given with our plot function and let's look at the data so it just tells me that this is the ozone concentration uh the number of instances what you have and you are looking at the data now we could also create a histogram by picking up a particular column that is such as solar from your air quality and that basically shows me the frequency of solar values and we can then try to find out what is the mid what is the mean what is the standard deviation and so on you can also look at your histogram and try to understand if it is left skewed and right skewed so we can do that now here let's get the temperature out from this particular data set let's create a histogram on temperature and that basically shows me the frequency of the temperature values and what values have the most frequency or most occurrence now you can create a histogram with labels so let's do that with the limit and then let's also use text to basically given the values which also takes the values and for each set of frequency or each set of values it gives me the labels now you can have a histogram with non-uniform width so you could do that by doing a hist function and then passing in your temperature you can say what will be the main what is the title what will be your x lab it will tell you a limit around x axis what is the color what is the border what are the breaks you would want to have for your bars and you can simply create a histogram using this so this basically takes the breaks which we have given such as 55 to 60 60 to 70 70 to 75 and so on so this is basically creating a histogram with non-uniform width and it purely depends on the kind of values what you have now you can also create a box plot which sometimes helps us in understanding the the data quartiles also understanding our outliers so you can create multiple box plots based on the data from air quality so we'll select all the data and then we'll do some slicing on the data so let's create a box plot which tells me the values and if you look at these points here like single dots these are basically your outliers we can learn about that more in later sections so you can use your gg plot 2 library to analyze a particular data set so for that we will first use the install dot packages and get ggplot2 so it says do you want to restart r and i can say yes so let it get the package i think the package was already there and now let's look at using ggplot2 so for that i have the library function and let's do a attach where i'm getting a data set which is empty cars now then i will create a variable p1 i will use ggplot i will pass in my data i'll give the aesthetics what is the columns which you would be interested in and then you are using geom underscore box plot to basically create a plot which gives me the box plot for the values here and this is based on the cylinders which is there in your data so we can always look at what does our data contain and what kind of values or features are available in the data now let's create a box plot we will also use the coordinate function and that basically gives me based on the data so i've changed the coordinates now if you look at the previous one where we created a plot we had mileage on the y-axis and cylinders on the x-axis now i did a coordinate flip and that's like your transpose function so you have created the box plot but you have just flipped the coordinates you can create a box plot and then say fill which is the factor of cylinder so that can be used to fill up the values in your box plot now what we can also do is we can create factors so we have learnt about factors earlier which is usually used to work on categorical variables so here let's create a factor which is empty cars gear you have am you have cylinder and if you look at the factors which we have created we have passed our data what is the field or the column we are interested in what is the level of values there and what are the labels for those values right so we have learnt about factors you can always look into the previous section and learn more about factors now let's create a scatter plot by using the ggplot function again we will use the data as empty cars i will go for mapping option and then i will give my aesthetics that is what would be x what would be your y and you also would want to use what kind of function you are using so let's go for geom pawn point and that basically helps me in creating a scatter plot now you can create a scatter plot by factors so here we will say gg plot so notice in all of these cases depending on the kind of data you have depending on the kind of plot you are interested in you will use the ggplot and then basically a function with that or the inbuilt package so here i'm saying data is empty cars i am going for mapping which basically will take the values for your x and y what is the color and the coloring will be done based on the factor values now if you remember factors will obviously have some levels and [Music] those levels will basically help you in differentiating between your categorical variables so i'm saying as dot factor on cylinder and then i'm using geom point to basically create this scatter plot so let's do this and i can look at the values of this one so it says must be there is an error which says must at least one color from the hue palette so let's look at that one so the error which we were facing when we gave color as the factor values was because when you look at these factors which were created with some labels if we look at the values of these it tells me there are any values in that particular column similarly your gear or similarly you can completely look at the complete data set it tells me cylinder you have am you have gear now these have some we have created some labels but these have n a values so what we can do is we can create a scatter plot as we did earlier by giving the aesthetics and that's a simple scatter plot wherein i'm also using geom point so that i can have these points by defaults or with defaults you can also give a color specific basically if you would want to have different kind of data in the same plot or i can create scatter plots by different sizes by giving a size or i can give a color and size and that's again one way in which you can create your scatter plots now let's also see how you can visualize one more data set which is mpg so i can also do it in this way where i set ggplot2 and then pass and look at the data set what we have here you can just do a view on this to see what my data contains if the fields have any any values if that's going to affect your plotting so now what we can do is we can create a bar plot or a bar chart so i am saying gg plot the data would be as we have given in previous lines that is ggplot2 mpg then i will say what should be in my aesthetics and what kind of chart are you going to create so i'm saying geom underscore bar so that's my bar chart and that has basically your class and count now you can create a stacked bar chart where your information is stacked in the same bars and we are still using the same data we are going for aesthetics which is class and then when you say geom bar which creates your stack bar we will use fill which is drive and we can always go back and look at our data for example you can always look into this so you have the drive column here and you are also working on this complete data set so let's go ahead and create a stacked bar chart and that basically gives me the information where you have the drive information which is stacked here now you can do a dodge by giving the position as dodge so we are still going to go for a stack chart but this time the bars will be next to each other and that can also be done which is very useful you can use this by using geom point where you are mapping and you are specifying what are your aesthetics so we were creating a scatter plot now you can also use or give more details where you can say color can be based on the class and we have different classes and based on that my points have been colored now you can also use a plot lyu or plotly library so let's install this one i will say yes for example let it basically restart so that all my packages are updated then i can access that package using library function and then create a variable to which you are assigning your plot underscore ly plot so data is empty cars what will be your x-axis what will be your y-axis and details on your marker which we have given wherein i will give a list which is size color which is a combination and then you have your line what kind of color it will have and what will be the width so this is where i'm going to use plot ly and let's look at this plot so it basically gives me some information now we see some warnings which are getting generated but there is you don't need to worry about that so you can look at the packages what you have and what options you are using so similarly we can create one more plot using plot ly and look at the values of those so that's a plot with a trend which explains me about my data so this is a simple small tutorial on understanding or how you can have your graphics or visualization used to understand your data obviously there are much more examples much more ways in which you can pass into your plot functions or your gg plot and the inbuilt packages which are available in r for your visualization now that could be for exploratory data analysis or explanatory data analysis so try these graphs and see if you can change these options and try or create new visualizations good morning and good evening everyone so welcome to this session where we will learn on time series analysis using our programming language so this is basically a mini project where we will look at time series data and how we can analyze it visualize it to basically find some important information or gather insights from the data now when you talk about time series analysis time series is basically any data set where your values are measured at different points in time so when you talk about time series data data is usually uniformly spaced at a specific frequency for example hourly weather measurements you have daily counts of website visits monthly sales total and so on so when you talk about time series that can also be irregularly spaced and sporadic for example times stamp data in computer systems event log or history of 9 11 emergency calls now when we work with time series data for example here i am taking a energy data set we can see how techniques such as time based indexing resampling rolling windows can help us explore variations in electricity demand and renewable energy supply over time now here we will look at some aspects of this data set which i am considering so there is this is open power systems data set and here is the data set i have we can look at the data set now this is in a simple format it has time it basically has values for consumption and then you have data for wind and solar and wind plus solar so in certain cases you have only the date and the consumption but then if we scroll down we will also find data for wind solar wind plus solar and so on so this is a time series data set which we would want to work on sometimes you may also have the data collected which just does not have the time but it may also have time stamp that is it would have say hour minutes and seconds and that can also be worked upon so let's consider this data set and let's work on this project where we will analyze this time series data set now here we can work on this time series data we can basically create some data structures out of it such as data frames we can do some time based indexing we can visualize the data we can look at the seasonality in the data look at some frequencies and also do some trend detection now when you talk about this data set it has electricity production and consumption which is reported as daily totals in gigawatt hours and here are the columns of the data which i was just showing you so you have data you have consumption you have wind you have solar and wind plus solar so this is the data we have and we will basically explore say electricity consumption and production in germany which has varied over time so some of the questions which we can answer here is when is electricity consumption typically highest and lowest how do wind and solar power production vary with seasons of the year what are the long term trends in electricity consumption solar power and wind power how do wind and solar power production compare with electricity consumption and how has this ratio changed over time we can also do wrangling or cleaning of this data or pre-processing of data and create a data frame and then we can visualize this now let's see how do we do that so i will open up my rstudio and let's look at the data set so here is the data set now i'm picking it up from my machine you can also pick it up from github so all the data sets or similar data sets can be find in my github repository and here i can look in the data sets you will find a lot of different data sets here there are some time series data sets such as power i can search for power or you have basically coal or you have this opsd germany daily data set and there are many other data sets which you can work on now to get the documentation on this project you can also look in my github repository and you can search for repositories and then basically you can look in data science and r and here there is a project folder where i have given the documentation sample data set and also your time series analysis related document this is also the code which you can directly import in your r studio and you can practice or work on this project so let's see how does that work so first thing is we will create a data frame from this data set now here if you see i am using header as true so that it understands the heading of each column i am also giving row.names and i'm specifying date so there is this date column in the data set as i showed you earlier let's look at it again so you have date consumption wind solar wind plus solar so you can suggest that date should become the index column which can be useful so you can do this now let's just create this let's look at what does this data frame contain and here if you see it shows me some data which has been now as a part of this data frame structure it starts with consumption wind solar wind plus solar and if you see this one is becoming my index column so i can always do a head and look at part of the data frame using head or tail so look at the first records so let's see this now that shows me the head data i can also do a tail and look at the ending values so if you closely see here we have wind solar wind dot solar and that basically has n a values so there are missing values but let's look at the tail and that tells me that there is some data available for wind and solar and wind solar now we can always look in a tabular format using view and we can look at the data so this shows me that there are values in these columns we see any values but if i really scroll down i can see some values which would be available for wind and solar and wind solar so i can just use view now i can look at the dimensions of this particular object and that tells me there are 400 4384 rows and four columns you can always look at the structure that is check the data type of each column which can be very useful so if i see here i don't see the date column because date column was considered as an index which can be useful but i also look at my other columns they are of the num types so that's the data type for each attribute or each column here now we would be interested in looking at this date column so let's look at the data type of this date column now if i try to do this this will show me that this is null because date as a column does not exist because we created it as an index so if i look at row names and then i search for my data show me the index column or row.names it tells me these are the values that's the date column which we are seeing here now we can access a specific row by just doing a my data and give the index value or row name value so let's look at that and that shows me based on this index you're looking at the value you can obviously search for a different date something like this you can also pass in a vector and you can give range of values so that is 0 1 2006 to 4 of january and we can look at this one so it shows me these are the values so here actually i'm not giving a range but i'm just selecting multiple values from row.names now we already know that in r you have a summary function so you can always do a summary and that gives you for each column it gives you minimum first quartile median mean third quartile and maximum values so we are looking at consumption we are looking at wind solar and wind dot solar now this is good but then if i would want to really visualize the data access the data do some analysis then it would be good to take all the columns and then we can later decide to change the data type of say date column if we want to use it so earlier i was using date as row.names or the name of the rows or index what you call in any other programming language so here i will just use my data set and i'll say header is true i'm calling it mydata2 let's look at the data and this one shows me five columns where in my first column is the date consumption wind solar and so on now looking at the structure so let's look at the data type so it tells me that if now i'm interested in looking at the date column from my data to data frame it tells me it is a factor with four 384 levels and these are the values so it is not in a date time format it's a factor now what we can do is we can convert this into a date format how do we do that so let's have a variable x and i'm going to use as dot date function and i'm going to pass in my date column so that's assigned to x now let's look at the head of x and it shows me the values we will also see what kind of class it is and we will look at the structure of x so class already says it is date type and look at the structure so it shows me the format now we have converted this column or column related value into x now how do i basically extract values out of it or make it a part of data frame so first i will use so all once it has been converted in date format i will go for as dot numeric and here i will create a variable called year and i will just do a format on x which is basically of date type and then i am saying percentage y so that will get me the year component out of this let's look at the values that shows me ear component now similarly we can get the month out of this and then basically look at the month values we can get the day out of it and we can get the day component now if i look at my data 2 which we had created earlier this basically had date consumption wind solar wind solar so what i can do is i can add these extracted columns such as year month day to my data frame using a c byte that is column bind and i will assign it to my data to again so let's do this and now if you look at head it shows me date so that should be date format consumption now this one might not be date format but we'll see you have consumption wind solar and we have extracted the year month and day which can help us for group by we can do some aggregations we can do a plotting and we can do various things by these additional columns now let's look at first three rows here so i'll say one is to three for my data two and that shows me some data here you can always do a head and look at the sample of data so that basically shows me month day your columns and then you have your date now what we can do is we would want to visualize this data we would want to basically understand the consumption now as i said if we want to visualize the data say for example i want this which is consumption of data over years and this one is in terms of gigawatts per hour as we were mentioning here gigawatt hours so if i would want to create this visual to basically understand the pattern of the data how do we do it so we can you create a line plot of full time series of germany's electricity consumption using the plot method now how do we do that so here one of the option is i can straight away use the plot method i can then say what would be in my x-axis what would be on my y-axis what would be the type of graph i would want to plot what is my name on x-axis y-axis and this is the simplest way so i am saying my data 2 i am extracting the year column and here i am taking the consumption so let's create a plot and here if you see we are looking at a plot we do see some tick times and we see that the data has been divided with every two years so from 2006 onwards to 2016 but then really this data does not give me uh you know a very useful way of looking at the rate or understanding it might be what i can do is i can use the same way but i can give apart from x-axis and y-axis i can say the limits that is x limit is 2006 to 2018 and y limit is from 800 to 1700 so we can do this and let's look at this again this is a plot but it really does not help me in visualizing and understanding the data so what are the better options i can go for multiple plots in a window as of now we are just sticking to one plot in window so if you would want to have multiple plots you can always change the value here and make it two or three that will say how many rows and how many columns as of now we will just keep it as it is par mf row now if i would want to plot i can straight away give the column name so i am interested in getting the consumption now i can just do a plot i'll say mydata2 and i will choose the second column which is consumption which we saw here from our data so consumption was the second column so i can just do a plot in a straight away way without mentioning your x-axis y-axis limits and so on and if you look at this this one is giving me a pattern now here i am looking at x-axis y-axis which is not really named we do not have a name to this graph and we are looking at the data it does show me some kind of pattern but might be we can make it more meaningful so i can do it this way where i say my data second column let's give access as year x axis y axis is consumption now that has changed the x-axis and y-axis now i can also give some more details i can say type should be line i have the line width i am saying color is blue and let's do this so this looks more meaningful might be shows a wavering pattern of consumption over years i can also give a limit of x that is 0 to 2018 and that basically shows me the range now we can change that and we can be more specific and saying x limit should be 2006 to 2018 and let's look at this now this one once you have given a proper limit it shows the line graph and it shows what was the consumption in 2006 and over a period till 2018. i can then use any of these options are fine but it depends on what and whom you are presenting the data or what kind of analysis you are doing so i can do a plot i can choose column second x lab which is x axis y axis type is line width giving x limit y limit and then i'm giving a title to this which is consumption graph and then basically you are looking at the line graph now those are the options which you can do either you could be very specific or you could just give your column which you want to plot or obviously make it more meaningful by giving all the details now what we can do is if we would want to look at this data and understand it better rather than just looking at a simple line i can take the log values so here i'm saying log of my data to second column so i'm taking log values of consumption and i'm taking the difference of logs so i can say difference and then you can basically increase or decrease this by multiplying it by some number so rest remains the same i'm changing the color and let's look at this plot and you see this basically is giving me a better pattern which makes meaning here we see the log values so this is you are using a simple plot function in r you can also use ggplot now for that we can install the ggplot package it's already there in my machine so i'll say no i will access this by using the library gg plot 2 and now i can use gg plot to plot so the way you specify here you can say mydata2 that's the data frame i am saying type as o and when i am saying line i am basically going to use x axis which is year y is consumption and let's look at this plot so again we are back to the one which we were doing earlier really does not make any sense it gives us some data but then really does not give me enough information i can in my aesthetics i can say x as year y is consumption i can do a grouping and then i can give line and plot so again we have some information but really does not help me right now let's look at other example so i'm just doing the same thing here and i'm looking at line type being tasked i'm using the gg plots other methods such as geom line and gm point to give me more information and if i look at the plot it does give me data it tells me what are the different values it gives me some kind of pattern but i would still prefer the way we were doing with plot now we can change the color and obviously add details to it so what we see is when you use the plot method which i did earlier it was choosing pretty good tick locations that is every two years and labels the years for the x axis which was helpful right but with these data points which we were seeing here or say for example this one or say this one or say this one we are looking at some data but then that really is quite crowded and it is hard to read you can look at the values but then it really does not give you enough information so we can go for plot method but then we will see how we can consider different data now if i would want to plot the solar and wind time series so let's see how do we do that so wind column is what i'm interested in so first thing is it was always good to find out the minimum and the maximum values in every column so i'm saying minimum i'm saying let's put in here my data 2 and then let's look at the values so we are looking at the columns we know consumption is the second column wind is the third column and you have solar as the fourth and this one is the fifth so let's say let's find out the minimum of each of these columns which we would want to plot so let's say minimum of data third column and here i'm also saying remove the n a values because we do not want to consider the n a values so let's let look at the minimum that shows me 5.7757 what is the maximum value it is 826 so that also helps me in giving a limit if i want to plot wind on y axis i can give a y limit from 5 to 850 consumption wise let's find out the minimum from second column and maximum and similarly for solar find the minimum and maximum and wind plus solar minimum and maximum so this will be helpful when you would want to plot multiple graphs or give some limits so that's fine now for multiple plots as i said instead of having one plot let's plot consumption and wind and solar and try to see a pattern so i can say par function and i will say three rows and one column so now when i start plotting you will see you will have multiple plots in one single window so let's see how we do it so here let's look at plot one so this one is consumption as we did earlier and let's look at the data so that gives me some data you can always do a zoom and you can look at the data you can basically expand this graph or you can reduce this graph to see what kind of pattern we have in consumption similarly we can basically choose date being x axis my consumption being y-axis right so this is being more specific because here we have a range but it really does not give me enough information so i will basically give x-axis y-axis i will give the name that is daily totals and then i will basically give consumption color and y limit based on my minimum and maximum limits so let's do this and now we can look at the data here so let's see this data makes a little more meaning because we are looking at the dates and let me do a zoom so it shows me all the dates it shows me the data points it shows me how the data pattern is changing for consumption now this is for consumption so what we can do is we can also extract specific data so if you see here i have done some testing where i am saying okay i would want to get a date specifically i would want to extract some value so we are looking at the date column but if you remember we did not change the data type we just changed the data type of date column we extracted year month out of it it would be good if we can convert a column into date time format and put that in our data frame now let's look at the plot2 this is mainly for your column which should be consumption and wind and solar so here i see it is solar data and i can plot this one to see how it looks like and that tells me from 2006 onwards we have some pattern i can be more specific where i say i would be giving date and then the column for solar x-axis y-axis what is the type what is the y limit and what is the color it is always good to specify your x and y axis give a name rather than let it automatically pick up now this makes more meaning because it shows me some dates similarly we can do for wind so either you do it just by giving the column or you give your x and y axis so let's look at this one and this shows me the data so we can choose plot three this one we can choose plot two we can choose plot one and we can put all that data in one graph so that's when you are putting in multi plots in one particular graph you can always do a zoom you can always look at the data right and this is usually useful to look at the pattern what kind of pattern we see what data we have and so on now moving forward so we have seen how you are creating these plots all in one window let me reset this back to one plot per window and let's basically plot time series in a single year so what we have seen is that when you look at the plot method it was quite crowded then we looked at solar and wind and if you compare that you will see your consumption pattern your solar pattern your wind pattern and basically we can see from this particular data some kind of pattern so electricity consumption is highest in the winter where we will see what is the consumption is it highest in winter or is it in summer we can see that by breaking a year further into months we can see that but we see a pattern which goes for every year or every two years being highest at a particular point of time and then it drops down so electricity consumption is highest in winter and that might be due to electrical heating and increased lighting usage and lowest in summer now when you look at electricity consumption appears to split into two clusters we can always look at the consumption one with oscillation centered roundly around 1400 gigawatts so you can always look at fourteen hundred gigawatts and you see all the values here which are in that particular consumption another with fewer and more scattered data points centrally roughed around one one five zero so if you really expand this you can see you will have lot of data points at this point now we might guess that these clusters correspond with weekdays and weekends which we can see if you break that data into yearly monthly weekly and so on now if you look at solar production that is highest in summer when sunlight is most evident and lowest in winter so obviously when you are making or gathering some insights when you're looking at the data you are also using your domain knowledge your business knowledge your you know knowledge of business to understand how this goes if you look at wind power production that's again highest in winters and drops down in summer so due to stronger winds and more frequent storms and lowest in summer so there is some kind of increasing trend in wind power production over years which we can see here over the years and all the time series data what we are looking at is referring or showing us some kind of seasonality that is we are looking at seasonality in which a pattern is repeating again and again at regular times at regular intervals so if you look at consumption solar and wind time series that oscillates between high and low values on a yearly time scale which we can break down and see i'll show you that it corresponds with the seasonal changes in weather over the year so seasonality does not have to correspond with meteorological reasons for example if you look at retail stale sales data that will show you yearly seasonality with increased sales in particular months so seasonality when we say can occur on other time scales so the plots what we are seeing here they are fine but if you look at those plots they might show some kind of weekly seasonality also so in your consumption corresponding to weekdays and weekend so let's plot for one single year now how do i do that so first is i will look at mydata2 that shows me the structure it shows me date which is factor other columns which are all numerics now like we did earlier i'll repeat this step where i'm going to convert the date column into date type look at head of it look at class of it look at the structure of it right and then what i want to do is i want to add this as to my data frame so i will create a variable called mod data and this one will have as data and i'm formatting the value of x which is date time into month day and year so let's do that and now you look at the mod data which i created like modified data so this is the format i have it is in date type if you carefully see here and then i can look at the head of it so it saves me more data now we are what we did here is when i said my data 3 so my data 3 we did a c bind and i did a mod data which is going to add this column to my other columns of my data too so my new data frame is my data 3 let's look at the structure of it and you see there is this date column i can delete it i can remove it i can let it be right so that depends on our choice might be we want to once our analysis done we want to remove the mod data right so we can keep both of them now let's basically extract data for a particular year now how do you do that so this is some wrangling so i will say mydata4 let's call it mydata4 and i will use subset function so subset will work on my data 3 that's the data and what i'll do is i will do a subset how do how is the subset found so i'll say take the mod data column the value should be greater than or equal to 2017 and should be less than 2017 december 31st so i'm getting data for one year and i'm storing it as my data four let's get the head of it and you see we are specifically looking at 2017 related data now let's do a plotting of this where i will only create a plot for one year so i'm saying my data for that's my new data what we got so here i am going to take the first column which is mod data i am going to take the third column which is consumption so i am looking at the date format for one year consumption values for it and then rest of the things as we have done earlier let us look at the plot and this makes more meaning right so when you look at this plot it tells me jan to jan it shows me some kind of pattern where i have divided the year into months right and it is broken down into say two months so jan and march and may and july and so on but we still see a pattern and that gives me good understanding of pattern where i've broken it down into months so this is where you have taken time series in a single year to investigate further and this is what we see right now we can clearly see there are some weekly oscillations what one more interesting feature is that at this level of granularity that is when you're looking at yearly data there is a drastic decrease in electricity consumption in early january and late december during the holidays or probably we can assume that this is holidays now i can zoom in further and look at just jan and feb data let's see how we do that and let's see how we work by zooming in the data further so to zoom in the data further let's see how we do it now here we have this mydata4 which is basically having a subset right so let's work on this one so i will say my data 4 which earlier i was taking data 3 i was doing a subset and i was giving the date but this time i will make it more narrower so i will say my data 4 i will say subset from my data 3 and i will choose mod data column which we have modified with the date format i will choose the starting date as 1701 that is jan and then let's go till feb and let's create this now let's look at the head of this so it shows me we have the data which is jan and then you you can basically look at more on this now again as i said earlier let's find out the minimum of this from the first column so that is basically your mod data so let's look into this one and that basically will give me minimum and maximum let's look at the values so this one tells me jan 17 january 1 and maximum is your feb 28th second month 2017. so we are actually looking at two months data here let's look at the y minimum so this is i will look at column three now what is column three consumption so let's look at the minimum value for consumption maximum value of consumption let's look at the values which can be given as our limits now this is the minimum and maximum now let's do a plotting for this data which has been narrowed down for consumption based on my data so i am saying my first column which is mod data and then third column which is consumption i am giving some naming convention for sorry namings for your x-axis y-axis what is my consumption or what is my title here what is the color and then you see i'm using x limit to give the minimum and maximum limit and y limit so let's look at this data and if you look at this data it is specifically for two months and again i can look at the pattern here what i can also do is i can add some grid here so i can basically look at this data and make more meaning out of it so it is bi-weekly data you can see now i can add a line here using ab line and then i can basically choose what lines i would want to add horizontally so that basically allows me to dissect the data and look at data in a more meaningful way i can also add vertical lines so vertical lines is i'm saying sequence will be minimum maximum and i'm saying an interval of seven so let's do this and this basically has added some lines every week and you can see at the end of week it is dropping and then it is starting again it peaks somewhere in the middle of the week and again it drops down so this is you're looking at your consumption data right now what we can also do is we can create some box plots so when we looked at zooming in data for jan and feb you can add some data points like this so consumption is highest on the weekdays as i showed you here and lowest on the weekends so this is what we are seeing when we are breaking the data or zooming it further for a couple of months so we have vertical grid lines and we have nicely formatted tick labels that is jan 1st and 15th feb first and so on so we can easily tell which days are weekdays and weekends with use of these grid lines and basically breaking it down so there are many other ways to actually visualize your time series data depending on what patterns you are trying to explore you can use scatter plots you can use heat maps you can just use histograms and so on now moving further we would want to explore the seasonality right so when you further explore the seasonality of our data we can use box plots basically to group the data by different time periods and display the distribution for each group now how do we do that let's come here and let's see how box plot works so i can just do a simple box plot and i can choose my consumption column and that gives me just the consumption data but this really does not give me any meaning i can look at solar data i can look at the wind data and we can also see some outliers here so we can create box plots but if we would want to do a box plot what is box plot it is basically a visual display of your phi number summary that is you want to look at your mean median you want to look at your 25th percentile 50 percentile or 75th percentile so we can use a quantile function use the consumption column and then you basically give a vector which shows you find number summary so that's your quantile and then let's do a box plot so if you are looking at quantile it tells me what is the minimum what is 25th percentile 50 75th 100 that's from my consumption column so let's create a box plot for consumption let's give it a name as consumption let's give y axis as consumption and a limit for y-axis now that's my consumption graph so i can look at yearly data now that will make more meaning rather than just looking at the complete consumption data so how do we do it early so we will say consumption and then i will say the year column so it is consumption but grouped based on year so here i can give x axis y axis and i can give y limit so let's create this and this makes more meaning we can give some coloring scheme here but now i'm looking at 2006 2007 8 9 and so on and we can look at the data what is the range right it gives me five percentile or sorry five number summary of the data per year and it basically allows me to look at the seasonality of this similarly we can create box plot by just giving consumption yearly grouped and here i am giving the title as consumption y-axis x-axis and y-limit wherein i can also use less so this is one more feature which you can do and that basically will give me the tick points if you compare this one to the previous graph so when i created this previous graph i had 2006 2008 and i had from 600 to 1800 and if i go for the next one i am basically seeing more useful information now let's look at monthly data so i would want to group it based on months and let's create that so this gives me the monthly data where i'm looking at months and i could select a particular year or i can just do a grouping based on months so i can have multiple plots to see a difference here so let's do this now let's create a box plot for consumption which is monthly data and let's give it a color let's look at the wind data which is again grouped monthly and let's look at the solar data which is grouped monthly now if i zoom in it basically gives me the seasonality of the data for your wind for your consumption for your solar so what we are doing is we are creating these box plots which are giving us values now what i can also do is i could look at the day wise also but before we look into this how do i infer some information from these box plots which are being created so this is what we have done where we are looking at the data for month and these box plots give me ear seasonality which we were seeing in earlier plots but give some additional insights so if i look at the data here it tells me the electricity consumption is generally higher in winter now this is based on months so we can see consumption is higher in winters and lower in summer so we can obviously look at our plot we can see where it is lower where it is higher and then we can look at the median and lower two quartiles are lower in december and january compared to november and february so that is you look at the quartiles and you will see that the median and lower two quartiles are lower in december and january here jan and december so you can look it from my plot now this is giving you some idea on seasonality now that might be due to business being closed over holidays now this one we were also seeing when we looked at time series for 2017 only and box plot basically confirms that there is this consistent pattern throughout the years now when you look at your solar and wind power production both will give you a year seasonality what we are seeing here and if basically i look at the data so it depends on what parameters you are choosing but if you look at solar it will reflect the effect of occasional extreme wind speeds associated with storms and other transient and since we are grouping it based on months we can see this pattern is quite evident every year now what we can do is we can group the data day wise so here let me again reset this to one plot per graph now i'll say box plot i'll say consumption which is group based on day now we know that there is a day column and let's give a while limit and let's look at the data so this is where i'm grouping the data day wise so you look at 31 days and you look at the box plot so this is where you are plotting it on a daily basis so you can look at the data you can break it down to a particular week so here i have given a day and i have chosen all the 31 days but i can break it down to a week and i can look at the data so if we look at the data per week or per day we can basically infer that electricity consumption where i'm doing a consumption group by day is higher on weekdays than on weekends so time series with strong seasonality can often be represented with models that can decompose signal into seasonality and long trend now this is an easy way now how do we look at the frequency of the data that could be interesting to see so let me look at say the yearly data which we were seeing here now let's go further and here we have looked at data so what we will do is we look at the frequency now when you look at the frequency when you talk about frequency in your data so we have the modified date column which gives me a frequency and if we really look into the data that will tell me that the data is on a daily basis so for that let's look at my data three again which gives me data and you can just see all the data's data or dates are in sequence so you're 22 23 24 25 26 and so on i can look at i can access a d player package that is basically allowing me to work in a better way now i can look at the summary of this and for all my columns i am seeing what is the minimum phi number summary date and consumption so date does not show me anything because this is not in a date format it is just a factor but other things have the fine number summary so we are looking at wind plus solar we are looking at year and month and day and all these columns now what we will do is we will want to find out the sum of each column how many entries does it have and we will say the value should any value should not be considered so let's look at this one so it tells me for my particular columns so let me run this again and that shows me for each column how many values you have and these counts do not include the n a values now similarly i can find out specifically for consumption i can find out is there any n a value so i'm saying is dot n a and let's find out if there is any n a value or missing value in consumption it says zero okay that's good if you look in wind it tells me there are 1463 entries which are any similarly solar similarly wind dot solar or wind plus solar so it gives me a count of n a values that is missing values and also values which are not missing so to understand frequency what we can do is we can find out the minimum on the date that is the first column and i am saying rm na dot rm is true that is get rid of n a values and find out the minimum and let's look at the minimum value this is the minimum from my modified date now if i would want to get the frequency i can basically use sequence function so i can say from x minimum that is the minimum value i want to look at the frequency that is day wise and let's just look at five entries and see if there is a day by day frequency so let's look at the value of this and obviously it tells me there is device frequency so that allows me to look at the frequency look at the type of it it is an integer class is a date so similarly we can say from x minimum we can basically look at the frequency month-wise and i can again look at five records so that shows me monthly data right so i can extract the data for frequency similarly yearly data and that's also very useful now we can select data which has n a values for wind so how do i do it i would want to find out the wind column and i want to find out where the values are and a so i will create a variable and here i will say mydata3 and then i give a conditional where i say is na in the column so let's do this now once i've done this once i've done this i have said that my selected wind data from my data 3 where we said na values and i will give the names to this so name should be in my data3 i'm interested in mod data consumption wind and solar so these are the four columns i'm interested in let's look at first 10 records here or first 10 rows so that tells me these are the values where wind has n a or missing values i can always do a view and that gives me the complete data so it basically shows me 1463 entries and here it shows me all n a values so you can look at all the way to the end and it shows me wind has n a solar does have some value here in the last row but then also if you see the numbers have a difference so you have one four six one and then you have two one seven four so there is a difference so there is some data in between where wind has some values so we have found out n a values now what we will do is we will select data which does not have any values so i will call it cell selected win2 i'll again use mydata3 i will say which but now i'm saying not any from this column and i will select the data for the columns so i'm interested in looking at 10 records and this shows me not any value so no more missing values so if i really look at this data as i saw earlier which has n a and if i look at these values which are not any for the wind column so looking at these two result we will know that in year 2011 wind column has some missing values so let's focus on year 2011. so how do i do that let's call it a different variable i'll say mydata3 i will say here when i say which where we were saying na here i will say the year should have a value of 2011 and i want all these columns let's look at the data here and this is showing me 2011 but we are not seeing all the values so there are some values but then there are some missing values also for 2011 based on whatever analysis we have done so let's look at the class of this it is basically a data frame do a view and this one will help me in finding out where are the any values so if you just scroll down looking at all the data let's search if wind column has a n a or a missing value and i will see if there is any missing value in which column or which row it is for the wind column so we have all the values which are existing i could select and search for one specific value and i'll show you how we can do that so here let's scroll all the way down so it's like you're exploring your data and seeing is wind column having n a or missing value for a particular row and let's scroll here and here you see there is a missing value for one particular row so 13th december 2011 has wind value 15 december has wind value but your 14th december does not have right similarly we can search so there was only one entry which was missing now that could be for some reason might be it was not calculated might be it was not tabulated so we have a missing value and that can affect my plotting that can affect my analysis so let's look at the number of rows in this which will tell me how many rows we have for 2011 so it tells me 365 so that is basically the number of days in a year now we will find out if there were any values so we earlier checked total number of na values per column that is in your row number 265 to 269 we can see here 265 to 269. so this is where we were seeing are there any n a values right so let's go back here and we want to find out the number of n a values for a particular year how do i do it so i can just do a sum i will say is n a now i am interested in my data 3 wind column and i am saying my year has to be 2011 but i am finding out the n a values so let's do this and it tells me one and that's right that's what we saw when we did a view let's see how many non-na values you have and that is 364 so that basically satisfies my logic so it's 364 plus 1 missing so there are 365 let's look at the structure of this it tells me you have modified date and date format you have consumption wind and solar now let's create a variable selected wind 4 i will save in 3 that is which was having all my n a and non n a values for 2011 i will say let's find out the n a value because i'm interested in finding out that particular row so i'm saying find out where the value is n a and i want all the columns let's look at this one and this is my specific row which has a n a value now we know that data follows a device frequency which we have clearly seen now let's select data which has any and non na values so let's say let's call it test1 i will use win3 which has any non-na values but now i will say i want the modified date which should be greater than 12 12 2001. now remember we had when we were doing a view we saw that one particular day or what we see here 14th of december there is no date so i will select a subset of data which includes this n a and non n a that is might be i can take 13th of december and 15th of december so let's start from 12 12 so the date should be greater than 12 12 that means 13 and it should be less than 16 so that is 15 and the columns right so now we have some data let's look at this so i have a i've selected a subset of data i could have done this using subset also so i have any and non-any values now why are we doing this so sometimes you might have some data for a particular column and you may want to find out if there are any missing values might be you want to fill them up or replace them with something so that is usually useful when you are doing a trend detection so say for example you have data for every month and might be in one one of the months you have missed or might be you have data for every year collected monthly and then in one of the years for couple of months you don't have the data like i can say 2016 i have data for all 12 months 2017 all 12 months 2018 might be i don't have data from march and june 2019 i don't have data for same months so i can forward fill or backward fill them using the previous years same month data so we can do that so here i have test data where i've extracted a subset of data i can look at the class of this it is a data frame structure of this it has the columns now let's use that library and function and use the tidy r package and what we will do is we will fill it up so i will use test one i will fill the wind column which has a missing value now once you do this if you notice it has done a forward fill so it has taken the previous value and it has just filled up that so you can fill up the data using different directions such as up and down left and right and so on so we can take care of missing values in our frequency data which allows us to basically analyze the data in a better way now here we will want to also look at some more data so this is to deal with frequencies of fill column wherein you can take care of missing values forward filled so filling values can be done in different directions as i said and you may want to first convert your time series to specified frequency if your data does not have a frequency but we had now if you do not have a frequency might be you can convert it into a frequency such as weekly daily monthly as i showed you and then basically you can do a forward fill for the value so for example if i have my data i can break it down into weekly and then look at the values and if there are any values missing for weekly data i can use a forward fill so that can take care of my frequency data then let's look at the trends of the data which is the last part of this project so basically let's look at the trend so when you say trend what does that mean so in time series data you always have some kind of trend so that will exhibit some slow gradual variability in addition to higher frequency variability such as seasonality and noise now to visualize these trends what we do is we use what we call as rolling means so we know how our data is spread over year or month or day but how about looking at a rolling average and see what is the difference so a rolling mean will tend to smooth a time series by averaging out the variations and frequencies so this can be higher than the window size so there is something called as windowing where you can choose a set of time frame you can also average out any seasonality on a time scale equal to window size so this will allow you to look at lower frequency variation in the data so when we are looking at electricity consumption time series we already saw there is a weekly pattern there is a yearly seasonality which we saw using box plots so we can also look at the rolling means of the time scales how do we do that so for this you can use some package like zoo and then you can basically use a rolling mean using this zoo package and you can say what is the frequency with which you want to calculate the rolling mean now how do we do this let's look at this data so here i'm going to my look at my data 3 which we have been using so far now let's call it a 3 day test you can give it any name i'm going to use my data 3 i'm using the pipe in function now i will use d plier and i will arrange the data descending in here now you can always break it down step by step and you can see the result of this so i'm going to arrange this data in descending order of year so obviously my last one 2017 or 2018 will be on the top you want to group the data by year so it depends on how many years we have we'll see so you can group the data by year now this data is then used to basically mutate so mutate function is going to allow me to use this rolling mean so i will call it as says zero three day so i'm going to calculate a rolling mean every three days for my consumption column and basically let's ungroup this so let's see how this works sorry yeah let's look at this and here when i'm doing a three-day test let's look at the result of this and then i'll explain this so if you see here we have the test three day column now this has the rolling average now what does that mean so first value here what we see is one three six seven is the average consumption in 2017 from the first date with the data point on either side of it that is you can look at this date so 1 1 3 0 then you look at you are looking at the value 1 3 6 7 here so you look at one one three zero one four four one one five three zero if i take a mean of these so for example if i would just do this part and that is giving me mean okay because i have a comment so let's basically add anything as comment and then let's do this so it saves me one three six seven that's what we are seeing here right so you've got getting a rolling average every three days similarly if you want every five days it takes the five values and it gets the mid value right so you can always find out the mean rolling mean for a particular frequency now let's do that for seven days that is weekly data and yearly data that is 365 days so how do i do it same logic my data test now i am using my data 3 i am arranging it in a descending order i am grouping by year so when you do a group by year so earlier when we did a grouping by and when we looked at the data it was telling me how many rows we had right so let's do a grouping by year and let's say test 0 7 so that's a rolling average every 7 days and i'm also saying take care of the n a values similarly i'm getting rolling average every 365 days might be you can do quarterly might be you can do half yearly and let's do this so let's create this my data test and let's look at the result of this so i will use my data test i will say arrange based on modified date now we know there is a column called modified date i want to just look at 2017 data so i'm doing a filter right and then i will choose what are the columns i'm interested in so i will look at the 7 and 365 day and let's look at say first seven records so let's do this and that basically gives me the consumption value modified date year and my rolling seven day average order of seven day mean which is for first 7 days and then 365 you will not see the data here but if i do a view on this i can basically see the values so you can always select a particular column to see the values these are the values for every 7 day rolling average this is for 365 days every 365 days so you see all the values are missing but every 365 entry you will have basically some data now let's do a plotting of this and basically visualize this data which we are seeing rolling average so let me first do a plotting one plot per graph and let's do a plotting i will take consumption data x-axis y-axis color and give a title to this so let's create this and that's my consumption data which is spread over a period of time and that's fair enough but now let's add some more plot to this so i will add the seven day rolling average to this so for second plot to be added in the same one in r you can use points so i will say points i will choose 7 data column type is line width x limit y limit and color so let's do this and that's my pattern seven day rolling average which basically gives me some kind of trend similarly i can add one more here and this time i will choose the 365 day and look at the pattern lines so now you see some dots here well you could do it in a different way so i can just add legend to this and i can say legend will be where in x-axis and y-axis so i am saying it will be 2500 and y is 1800 so my legend will come in somewhere in here i am saying my legend will have consumption test and this one i can give some names i can give what is the color i can say what kind of legend it explains what is for each color and then basically a vector so let's add a legend to this and i've added a legend now you can do a zoom and look at the data and here i see that my x-axis is fine but y-axis is going a little out of my plotting area so i can actually change that so here i have 1800 how about making it 1600 and let's look at this one so we can basically go for this one and start again here plot and points and line and then add a legend right and you can basically place your legend anywhere in the plot so this basically is giving me the trend what i'm looking at my rolling average so similarly you can look at the trend for wind and solar data so what we are seeing here is when you look at trend this is one more way of looking at it you can always create plots in different ways so seven day rolling mean has smoothed out all weekly seasonality which we were seeing here in my graph where you look at every seventh day preserving the yearly seasonality so 7 day will tell that electricity consumption is typically higher in winter and lower in summer so better is you break it down yearly so here if you look at every year you can see when is winter when is summer what is the seasonality what your trend what you are seeing here and if there is a decrease or increase for few weeks every winter so similarly if you look at 365 now as you said as i said rolling average basically reduces the variation so if i look at 365 rolling mean we can see long-term trend in electricity consumption is pretty flat now that's what we are seeing it's kind of pretty flat there is not much variation over the years if you really join these dots so we can basically see some highs and lows and that gives me a trend now this is how you can do a trend detection and similarly we can do plotting for wind and solar so this is a small project which i demonstrated using r now all this code which you have here in the form of a project dot r file you can find here in my github page this is a document which explains some things feel free to download this and you can add details to it this is the sample data set which you can also find in my repository in the data sets folder so continue learning and continue practicing r excel is a really powerful tool for data analytics and reporting and pivot tables are one of the features that excel offers for creating tabular reports to summarize our data let's begin by understanding what is a pivot table a pivot table is a tool that summarizes and reorganizes selected columns and rows of data in a spreadsheet to obtain a desired report it does not actually change the spreadsheet data it simply pivots or turns the data to view it in different perspectives pivot tables are specially useful with large amounts of data that would be time consuming to calculate manually now let's understand the different components of a pivot table so there are four main components first we have rows when a field is chosen for the row area it populates as the first column in the pivot table similar to the columns all row labels are unique values and duplicates are removed columns is the second component when a field is chosen for the column area only the unique values of the field are listed across the top then we have values each value is kept in a pivot table cell and displays the summarized information the most common values are sum average minimum and maximum finally we have filters filters apply a calculation or restriction to the entire table so let's jump over to microsoft excel and let me show you the data set that we will use in this demo so with india being ready for its 16 census in 2021 that is next year it is a good time for us to analyze india's last census data from 2011 and see where different states and cities across india stood in terms of population literacy and other socio-economic factors we will analyze this data by creating different pivot tables in excel and explore some of its features so let's begin first i'll show you one of the features that excel offers us so suppose i click on any cell and hit control plus q you can see our entire table is selected and at the right bottom there's an option of quick analysis now you can see by default excel has prompted certain features such as formatting we have charts totals and there's one more called tables now excel by default has created some pivot tables for us now the first one you say is sum of district code by state names next we have sum of sex ratio by state name then we have some of child sex ratio some of male graduates and some of female graduates by state name and there are others before creating our pivot table so let's have a final look at our data set so first column you see is the city column so there are different cities from different parts of india then we have the state code we have the state name district code we have the total population followed by male and female population next you can see we have the total literates from each city then we have the male and female literates next we have the sex ratio then we have the child sex ratio next we have total number of graduates and finally you can see we have male and female graduates so using this table we'll create several pivot tables now first of all let's create a pivot table to find the total population for each state and sort it in descending order so you can see here we have the problem statement so our first pivot table will have the total population for each of the states in descending order so to create a pivot table you can click any cell in your data go to the insert tab and here left you can see we have the option to create a pivot table so let me select pivot table now my range is already selected the entire table and here i'll choose existing worksheet because i want to place my pivot table in the same worksheet and i'll give my location i'll point to cell q5 now let me click ok you can see the pivot table fields appears here on the right now since we want to find the total population for each state so what i'll do is i'll drag my state name onto rows so here in our pivot table you can see we have the different state names listed now we want the total population for each of these states so in the field list i'll search for total population which is this one and drag it under values you can see we have our sum of total population for each of these states by default excel will sum any numeric column you can always change it to average minimum maximum anything you want now we want to sort this column in descending order so i right click go to sort option and choose z to a that is largest to smallest you can see here in 2011 maharashtra had the highest number of population or the total population in maharashtra was the highest then it was uttar pradesh we had andhra pradesh and if i come down we have nagaland and andaman nicobar islands towards the end so this is a simple pivot table that we created now the next problem we have is we want to find the total sum of literates in each city belonging to a certain state so let's see how to do it i'll click on any cell go to insert and here i can click on pivot table my range is selected i'll choose existing worksheet and give my location which is q5 i click on ok now here we want to find the total sum of liters so what i'll do is first let me drag total literate's column to values you have the total sum of literates from all the states next i want to see the sum of total literates based on states and cities so let me first drag state name onto rows and then we'll drag city onto rows you can see here we have our pivot table ready to the left of the pivot table you can see we have the state names and the cities per state and on the right you can see the total number of literates from each city if i scroll down we have assam then you can see we have bihar and if i keep scrolling we have all the states haryana himachal pradesh there's jammu and kashmir which has now become a union territory we have jharkhand karnataka and other states as well moving on okay so the next thing we want to see is what is the average sex ratio and the child sex ratio for each state with that we also want to find the states that had the highest and lowest sex ratio in 2011. so let's create a pivot table for this and click on any cell go to insert choose pivot table click on existing worksheet i'll select cell q5 and click on ok now since we want the average sex ratio and the child sex ratio so first i'll drag those columns either you can manually scroll and drag it or here you have the option to search for it so if i look for child you can see we have the same column listed i can just drag it from there let me delete this and i also want the sex ratio so i'll place it on top of child sex ratio next we want to see it based on different states so what i'll do is i'll take state name and put it under rows so here you can see we have our pivot table ready on the left you can see we have the different state names listed and on the right we have the values now we want to find the average now by default excel will sum the numeric columns you can see it tells you sum of sex ratio and child sex ratio so what you can do you can click on this drop down and go to value field settings and here summarize values by you can choose average you can see the custom name it says average of 6 ratio click on ok our entire column is now giving us the average sex ratio similarly for this column let me convert it into average i'll again click on the drop down go to value field settings click on average and click ok and you can see here we have the average of child sex ratio for each of the states now the next question says which states had the highest and lowest sex ratio so we'll consider this column so we'll sort it in any order you want you can do it either ascending or descending let me short it in descending order you can see we have our column sorted now so in 2011 kerala had the highest sex ratio and if i scroll down to the bottom you can see himachal pradesh had the lowest which is around 818. up next let's explore one more feature of pivot table so suppose you want to see the top or bottom few rows of a pivot table you can do that as well so here we have a question at hand we want to find the top three cities with the highest number of female graduates so let's see from the entire pivot table how we can filter the top three cities so i'll go to insert click on the pivot table option go to existing worksheet click on q5 and hit ok now since we want to find the top three cities i'll drag city column on to rows and then we want the female graduates so in the search bar i'll look for female and i'll choose this column that is female graduates and drag it here on to values so i have the sum of female graduates for each of the cities now since we want to find the highest number of female graduates in the top three cities so let me first sort this column i'll sort it in descending order now we have it sorted now from this you can say that delhi greater mumbai and bangalore are the top three cities but it's displaying all the cities for us so let's filter only the top three so what you can do is right click and go to filter under filter you have the option of top 10 i'll select this here i only want the top three so either you can go down like this or you can directly type three your column is already selected let me just click on ok there you go we have the required pivot table ready and it only displays us the top three cities with the highest number of female graduates now the next thing we want to see is how to use a slicer in a pivot table so we have a question here what's the total population for all the cities in rajasthan and karnataka so let's create a pivot table for this and see how you can use a slicer to filter the table click on existing i'll click on a location this time q6 click ok now since i want the total population so i'll drag total population on to values and then i'll select the city onto rose and then the state name also i'll place it on top of city so you have in the pivot table all the states and their cities and on the right you can see the total population for each of these cities but our question is we want to find only for rajasthan and karnataka now for that what you can do is go to insert and create a slicer either you can create from this option or you can go to pivot table analyze option and here you have the option to create or insert a slicer and click on this and since we want to slice the table based on state that is rajasthan and karnataka i will choose state name as my slicer field you can see this is my slicer here now you only want the data for rajasthan and karnataka so i'll search for these two so here we have karnataka so let me select karnataka first and i also want for rajasthan so let me select rajasthan also you can see in our pivot table we only have data for rajasthan and karnataka so this pivot table shows you different cities from karnataka and the total sum of population from each of the cities and similarly we also have for rajasthan moving ahead now we will see another very interesting feature of pivot that is how you can create percentage contribution of a table for example we have a question here what's the percentage contribution of male and female literates from each state now we want to see in terms of percentage and not as sum or average let's do that i'll create my pivot table click on existing and i'll select an empty cell okay now here since we want to find the percentage contribution of male and female literates so first i'll drag male literates onto values followed by female literates onto values by default it has summed up the male literals and female literacy value and also i want to drag state column to rows so here you can see the sum of male literates and female literates per state i want to convert this as percentage contribution so what we can do is i'll select any cell and i'll right click and i'll go to show value as and here i have the option to select percentage of grand total so i'll select this you can see we have the percentage contribution of male literates to the total now if i sort this you will get to know which state contributed or has the highest percentage contribution so we have maharashtra for male literates then we had uttar pradesh in 2011 if i come down we had meghalaya nagaland and andaman and nicobar islands as those states which had little or minimal contribution to male literates similarly let's do it for female literates i'll go to so value as and select percentage of grand total so you can see here also maharashtra uttar pradesh then gujarat and all had the highest percentage contribution to female literature so this is another good feature to convert your data and see it in terms of percentage now moving ahead let's say we want to find the bottom three cities from each state that had the lowest female graduates we can do that as well i'll go to insert click on pivot go to existing worksheet select an empty worksheet and click on ok now since i want to see based on states as well as cities so let me drag the state name first onto rows and let's drag the city column onto rose next we want female graduates so let me look for female graduates in the field list i'll drag it on to values now we have the list of states and their respective cities and to the right of the pivot table you can see the sum of female graduates from each city now first i'll sort this column i'll right click go to sort and click on shortest to largest now we have sorted our female graduates from shortest or smallest to largest now since i want to find the bottom three cities from each state i'll come to this cell right click go to filter and select top 10 now i'll replace to opt-in with bottom and i want the bottom three cities from each state i have my column selected that is some of female graduates if i click on ok you can see here some of the states don't have three cities so you can see andaman and nicobar islands has only one city that is sport player while the remaining you can find the bottom three cities with the lowest number of female graduates so andhra pradesh had these three in assam we had nagao then there was dhibrugar and silcha similarly if we come down in haryana we have palval kathal and zind if i come further here you can see for karnataka this gangavati this rani benue and the scholar similarly you can see for kerala as well now moving ahead now in the next example i'll tell you how you can create a calculated field or a calculated column in excel with the help of a pivot table so in a pivot table you can create or use custom formulas to create calculated fields or items calculated fields are formulas that can refer to other fields in the pivot table calculated fields appear with other value fields in the pivot table like other value fields a calculated field's name may proceed with sum of followed by the field name so here we have a sales table that has columns like the items which has different fruits and vegetables and those have been categorized as fruits and vegetables we have the price per kg and this is in terms of rupees and we have the quantity that was sold now let's see if you want to find the sales for each item in the table you can create a calculated field so your sales column is going to be the product of price per kg and quantity so let me show you how you can do that with the help of a pivot table i'll create a pivot table first click on an empty cell hit ok now if you see on the top under pivot table analyze and under calculations we have the option fields items and sets if i click on this drop down i get the option to create a calculated field or insert a calculated field i click on this i'll give my field name as sales and i'll select my formula i'll first click on price per kg and hit insert field i'll give a space hit shift 8 to give the product symbol and then i'll double click on quantity now this is my formula for sales that is price per kg multiplied by quantity i click on add and i click on ok if you see here there is a calculated field that is present in the pivot table fields which is sales but it did not add it to our original table our original table is the same but here we have added a calculated field which is present only in the pivot table list now we can use this it has already taken it under values now let's say i want to find the sum of sales for each item under each category you can see it here we have our category fruit and we have our category vegetable and under that we have different items like apple apricot banana similarly in vegetables we have broccoli this carrots corn eggplant and others so this is how you can create a calculated field in a pivot table now this one more good feature that excel offers us in pivot table is to create a pivot chart so you can use your pivot table and create different charts so i'll show you how to do that if i go to insert here i have the option of recommended charts click on this excel gives me some default charts which you can use let's say i'll select this let me drag it a bit to the right here you can see i'll close this pivot field list this is a nice bar chart that excel has created this is called a pivot chart now here you can see the category fruits and vegetables and the different fruits and vegetables or the items in the y-axis you can see the total sales if you see from the graph guava made the highest amount of sales now if i sort this let's sort this first you can see it here fruit guava made the highest amount of sales now since i sorted and changed my pivot table the pivot chart also automatically gets updated similarly there are other charts also that you can create let's go to the insert tab and let's click on recommended charts again let's look for a pie chart so this is a pie chart that you can create let me click on ok so here is our pie chart and each pie represents a certain item and the pie that has the highest area represents it had the highest amount of sales in this case you can see it is guava and similarly we have other items as well this is fruit banana that's corn and we have spinach and others let's explore a few other charts so first i'll click on my pivot table go to insert and under recommended charts let's now select a line chart if i hit ok move it to the right so this is a line chart you can see it starts from guava which had the highest amount of sales then it drops and in the x-axis you can see the different items similarly when it starts with the vegetables broccoli made the highest amount of sales with 2800 rupees and our lowest was eggplant at 900 rupees for fruits papaya sold the least at 700 rupees let's take another chart i'll go to insert under recommended charts let's see this time we'll see a bar chart now this is a horizontal bar chart and not a vertical one we just saw a vertical column chart like this this is an horizontal bar chart now you can always increase and decrease the size of these charts let's explore a last chart let's take the area chart for now so this is an area chart again it looks similar to the line chart it starts with guava which had the highest amount of sales similarly papaya under fruits had the lowest amount of seals under vegetables it was broccoli and finally eggplant made the lowest amount of sales under vegetable now let's go to our first sheet and summarize what we have done in this demo for people tables in excel so we had our data this is a 2011 census data from india we had the different cities the state names and we had the total population total literature female literates male literates we had the sex ratio total graduates and other information so we began by understanding how to create a simple pivot table where we calculated the total population for each state and sorted it in descending order we found that maharashtra uttar pradesh had the total population in 2011. then we saw another private table where we calculated the total sum of literatures in each city belonging to a certain state so you can see we had the different state names and the cities under each state then we saw another feature where you could calculate the average of a certain numerical column so here we calculated the average sex ratio and the child sex ratio for each state and found out which one had the highest and lowest sex ratio after that we saw how you could find or filter tables we saw how to find the top three cities with the highest number of female graduates we found out that delhi greater mumbai and bangalore but the top cities with highest number of female graduates next we saw how to use slicer in a pivot table so we sliced our table based on rajasthan karnataka state and saw the total population for all the cities in rajasthan and karnataka in the next sheet we explored another feature that was to find the percentage contribution of male and female literature from each state then here we saw how to find out the bottom three cities for each state having lowest female graduates and one thing marked that some of the states did not have three cities for example andaman had only one city that was sport player but the others we found out the bottom three cities that had the lowest female graduates finally we looked at how to create a calculated field in a pivot table so we saw how to create a calculated field called sales and then we explored how to create different charts and graphs so this was an area chart that we saw there's a column chart we also saw or looked at a bar chart that was a horizontal bar chart similarly we saw how to create a pie chart as well in this video we'll be creating two dashboards using a sample sales data set so if you want to get the data and the dashboard file that we'll be creating in this demo then please put your email ids in the comments section of the video our team will share the files via email now let's begin by understanding what is a dashboard in excel a dashboard is a visual interface that provides an overview of key measures relevant to a particular objective with the help of charts and graphs dashboard reports allow managers to get a high level overview of the business and help them make quick decisions there are different types of dashboards such as strategic dashboards analytical dashboards and operational dashboards an advantage of dashboards is the quick detection of outliers and correlations with comprehensive data visualization it is time saving as compared to running multiple reports with this understanding let's jump into our demo for creating our dashboards we'll be using a sample sales data set let me show you the data set first so here is the sales data set that we'll be using for our demo so this data set was actually generated using a simulator and is completely random it was not validated though we have applied certain transformations to the data using power query features so this data as you can see has 1000 rows so using the simulator we had generated thousand rows of data similarly if i go on top you can see this data set has 17 columns now let me give you a brief about each of the columns so first we have the region column so we have middle eastern north africa this north america asia sub-saharan africa and others similarly we have the country names from which the item was ordered the third column is the item type so we have different items cosmetics vegetables there's baby food cereal fruits etc then we have the representatives name or you can see this as the customer name who ordered the product then we have a sales channel column so there are basically two channels whether the item was sold offline or online next we have the order priority column now here c stands for critical then we have h which is for high priority orders then we have m for medium priority orders and finally we have l which is for low priority orders you can see the order date column then we have the order id the ship date next we have units sold which is basically the total number of units sold for each item then we have the unit price column this is the price at which each product was sold then we have the unit cost column which is basically the production cost for each of the items next we have the total revenue the total revenue is actually the product of units sold and unit price then we have the total cost column now the total cost column is actually the product of units sold and the unit cost similarly we have the total profit column so total profit is the difference between total revenue and the total cost and finally we have created two more columns that is order year and then we have order month now these two columns were actually generated using the power query features so we use the order date column which is this column and extracted order year and order month so first we are going to create a revenue dashboard where we'll focus on generating reports for revenue by order year revenue by year and region revenue by order priority and much more we'll create separate pivot tables and pivot charts and format them to make them look more interesting and presentable we'll add slices and timeline to our dashboards in order to filter it based on specific fields now let's create our first report to see the total revenue generated each year so we need to create a pivot table for this i'll click on a cell in my data set and then i'll go to the insert tab here we have the option to select the pivot table i click on this you can see my table range is selected next i want to place my pivot table in a new worksheet and let's just click on ok there you go so we have a new sheet where i can place my pivot table so first i need to find the total revenue generated by each year so what i'll do is i'll drag my order year column under rows and then i'll select the total revenue column under values you can see i have my pivot chart ready now if you want you can sort this so from the data you can see we have order year from 2010 to 2017. now based on this data let's create our pivot chart so i'll click on any cell go to insert and here you have the option to select recommended charts i click on this now actually i want a line chart so i'll click on line here and select ok there you go so we have successfully created our first pivot chart now let me show you how you can format this chart to make it more readable so first let me delete these so i'll right click and select hide all field buttons on the chart so this will delete the buttons present on the chart now let me go ahead and edit the chart title so the title i want is total revenue i'll type it down by year all right next let's do a few more transformations so if i click on this plus sign which is actually for chart elements we have some options like to add access access titles chart title data labels this grid lines legend and others okay so let's remove the legend now you can see the total legend is gone now let me add access titles so we'll label our x axis and y axis so here under x axis i can write it as year similarly on the y-axis i'll put revenue okay now you can move a bit all right now let me select this chart style option and go to colors first here i'll select yellow color okay and then let me go back to style let's select a new style from this list i want this style okay now you can also add data labels so i'll just click on data labels you can see we have the revenue for each of the years now this is not readable at all so we'll format this a bit if i click on this arrow here i have more options if i scroll down you can see we have something called as number here i'll expand this and under category i'll select custom now here we'll give a format code which is a bit different so this is actually a kind of a formula so i'll write if my revenue value is greater than let's say 9 lakhs 99 000 let me make sure there are 6 9's here so 1 two three four five and six okay we are good to go i'll close the bracket i'll give a hash give to commas so if the revenue is greater than 9 lakhs 99 000 i'll put it in the format of millions so within double quotes i'll write m i'll give a semicolon followed by another hash and if the value is less than the desired number it should be 0 million let me click on add all right you can see how nicely we have formatted our data and you see here we have added the new format which is in millions all right now if you want you can go ahead and adjust the boxes let me move this a bit up i'll delete this now if you notice this line chart you can make a few conclusions for example if you see here in 2010 the total revenue generated was nearly 175 million now this came down to 150 million in 2011 then the revenue constantly grew from 2011 till 2014 it reached 195 million and after 2014 it again came down to 180 million and the revenue dropped significantly between 2016 and 2017 in 2017 the revenue was just 96 million now before moving ahead to my next chart let me just rename this sheet so i'll write it as revenue by year all right now let's analyze the revenue generated each year in different regions so for this we'll create another pivot table let me close this i'll click on any cell go to insert and select pivot table i just click ok so that my pivot table is placed on a new sheet all right now this time we want the revenue by each year and region so first of all let's drag region to columns then let's drag the order year column under rows and then i'll select total revenue on to values so here you can see we have the pivot table ready so for 2010 you can see in asia this was the revenue generated similarly if you see for 2013 this was the total revenue generated in europe and we have for other years as well now let's create a line chart based on this pivot table so i'll select any cell in the pivot table i'll go to insert and i'll click on recommended charts from this list i'll select my line chart and click on ok there you go so we have our next pivot chart ready so on the right you see the different regions that are present in different colors let me just expand it so that you can see all the regions we have so in total we have seven regions and each of the regions have been represented in different colors so if you notice this graph for the sub-saharan african region in 2012 sub-saharan africa made the highest amount of sales now from the sample data you can also tell that the revenue for north america has been significantly low compared to other regions similarly if you see for europe this was the revenue trend between 2010 and 2017. so if you see here in 2011 the sales were at this level then it significantly dropped in 2012 then in 2013 there was a huge spike and then and again came down in 2015 and so on so you can make your own conclusions by looking at these line charts now let's format this chart so first of all let's delete the field buttons present on the chart and we'll also delete the legend all right now let me just reduce the size of the chart next we'll add a chart title so we'll give the title as revenue by year-end region okay you can also format the y-axis in terms of millions so i'll right click on this axis and i'll select format axis i'll scroll down and here we have the number drop down let me scroll again under category i'll select custom and we'll use this format that we created for our previous chart there you go you can see our access labels have been changed in terms of millions now so let's close this and let me save it now you can reduce the font size or increase the font size let me just show you suppose you want to increase the font size of the chart title so you just select it and from here you can either reduce or you can increase you can see now it's 12 if you want you can make it 16. similarly you can also edit the access labels also by selecting the chart title you can also move it to left or right or you can place it in the center as well for the time being let me just keep it to the left all right now we'll see the revenue and total cost by each region and will create a combo chart for this so let me show you how to do it i'll go to my data sheet i have my cell selected go to insert and click on pivot table let me just click on ok all right so for this i'll select my region onto rows and then i'll have two columns under values the first one is going to be total revenue and the next column will be the total cost column all right so here we have the pivot table ready now based on this pivot table let's create our pivot chart so i'll go to recommended charts and if you see below at the bottom we have combo chart so this is the preview of the combo chart all right now let me just click on ok there you go so we have a nice combo chart ready here now the way to look at it is the bars represent the total revenue which is this column now the line represents the total cost so let me go ahead and edit this chart a bit so first of all let's delete the field buttons all right and let's also remove the legend from here next we'll add data labels so i'll click on data labels here okay so these are the data labels for the bars or the revenue column now let's format the data labels in terms of millions so i'll click on this arrow go to more options if i scroll down i have number from here i'll select custom and i'll choose my type that is in millions all right so you can see we have formatted our data next thing we'll add a chart title so here i'll write as revenue by region it's actually revenue and total cost by region before moving ahead let me rename the sheets as well i'll write revenue and total cost similarly sheet 3 also am going to rename it as revenue by year and region so this makes your sheet more readable all right now moving ahead next we are interested to get the revenue generated by order priority and for this we are going to create a pie chart so let's go to the data sheet and create our pivot table first i click on ok now i'll select order priority under rows and under values i'll select total revenue so this is a very simple pivot table so you have your order priority so c is for critical h is for high l is for low and m is for medium now based on this let's create a pie chart so i'll go to recommended charts and here you have pie chart i want to select this 3d type of pie chart and i'll click on ok all right so we have our pie chart ready let me just resize it and from here i'll remove the fill buttons and i also don't need the legend so i'll delete this as well all right now let's give a chart title so this is going to be revenue by order priority now let's add our data labels i'll check this option okay now let's again format this in terms of millions so here i'll click on the last option i'll go to numbers under category i'll select custom and my type is going to be in terms of millions there you go let me close this i will just move this to the center all right now if you want you can change the color of the text as well so let's have it in white color and see how it looks okay so this looks pretty decent cool now moving to our next report so this time we are going to find the total revenue by countries so we have multiple countries present in our data set we want to visualize the revenue generated in each country so for this we are going to create a horizontal bar chart so let me show you how to do it but before moving ahead let me just rename this sheet so i'll write revenue by i'll just put op which stands for order priority all right now let's create our horizontal bar chart i'll go to insert click on pivot table and select ok so i want my revenue based on different countries so i'll select country and put it under rows and then i'll choose total revenue and place it under values so here you have the different country names we have afghanistan this albania let me scroll down you have bangladesh there are a number of countries you have czech republic there's estonia france gabon similarly if you scroll down we have india there's jamaica italy and all the way to the bottom if we go we have new zealand this netherlands philippines portugal we also have singapore lots and lots of countries we have the uae united states of america zimbabwe and others all right let me go up so based on this pivot table let's create our pivot chart so i'll go to insert and select recommended charts from here i'm going to select the column chart you can see the preview here and let me click on ok alright so here you can see the different country names at the bottom and the revenue for each of the countries let's go ahead and edit this chart so first of all i'll delete the field buttons okay and let me also remove the legend here i'll write revenue by countries this is going to be my chart title okay let's format this chart a little more so i'll click on this option and we'll select a new style let's say i'll select style 6 okay and let me now go under colors and we'll select the color of the bars so let's choose this color okay so you have a horizontal column chart ready and at the bottom you can see the different country names and we have the revenue cool now let me go ahead and rename this sheet so i'll write revenue by countries and hit enter okay and finally we create another report which is going to be part of our revenue dashboard and this is revenue by items so we'll visualize our revenue for different items present in the table so if you see this we have cosmetics vegetables cereal fruits this cloth snacks households and other products as well so let's check the revenue for each of these items so we'll continue the same drill i'll create my pivot table on a new worksheet and this time i am going to drag item type under rows and will have the total revenue under values so here on the left of the table you can see we have the different item names and then we have the total revenue so let me just short this total revenue from largest to smallest so you can see here office supplies made the highest amount of revenue followed by household then cosmetics and fruits made the lowest amount of revenue i'll click on this go to insert and select recommended charts this time i am going to create a bar chart so this is how my bar chart is going to look like i'll select ok all right now let's format this chart a bit i'll delete the field buttons and i'll delete the legend as well and let's edit the chart title so this is going to be revenue by items cool we also want to change the color of the bars so i have selected all the bars i'll go to my home tab and here let's say i want to select green color all right i've edited my chart a bit now let's make it 14 and i'll remove the bold okay here if you want you can change the font also let's keep it in blue color all right finally let's rename this sheet so i'll write revenue by items cool finally now it's time for us to merge all the charts that we have created to our dashboard so let me show you how you can create the dashboard i'll create a new sheet and first thing i'm going to do is i'll click on the view tab and uncheck gridlines so this will remove the gridlines present in the worksheet next i'm going to insert an image so we'll have a background image on our dashboard so the way to do is i'll go to the insert tab and under illustrations i have the option to select pictures or insert pictures so i'm going to insert picture present on the device that is my computer i'll go to desktop and here i have a folder called excel dashboard files and i'll select this dashboard background and hit insert so this is going to insert an image now let me just drag this image so it covers a fair enough portion so i'll hit shift and i'll drag it all right so you can see i have successfully added a background image if you want you can still expand this background image a bit to the right cool now the next thing is going to be the title of the dashboard so i'll click on insert and here i have the option to select a text box so i'll click on a text box and i'm going to place a text box in the middle and i'm going to name this text box as excel revenue dashboard on sales i'll centralize it let's do some more formatting so i'll select this text box on the top you can see shape format here i'm going to expand this shape fill and i'll select no fill so my text box is transparent now and i'll also remove the outline all right now let me just double click on the title of my dashboard and i'm going to select a font you can select whichever font you want let me stick to britannica bold and i'll increase the size to let's say 30 all right i'll just drag the text box i'll make the text as white instead of black all right so we have our title of the dashboard ready now if you want you can also insert some icons to this dashboard so i'll go to insert and i'll click on illustrations again and select pictures i'm going to add this to pictures which is of a store and a cart to make it look visually appealing so i'll place the icons here and similarly let me just copy it and i'll place the cart and the store to the right as well all right next the idea is to bring in all the charts that we have created and place it on the dashboard so let me copy each of the charts and place it on the dashboard so i'll hit ctrl v to paste it and we'll resize this as well all right similarly let me bring in all the other charts as well all right so now you can see i have added all my charts and graphs to this dashboard so you can see here we have our line charts our column charts the combo charts this pie chart and others now let me go ahead and format these charts a little more so you can see this looks a bit cluttered so let's adjust the labels let me bring this down similarly i'll bring 190 million a little below all right this looks fine now one more thing we are going to do is we'll remove the white background from each of the charts and make it transparent so let me show you how to do it so i'll select this chart then i'll right click and go to format chart area here on the right you see we have an option called no fill so if i select no fill you can see the white background is gone now similarly let me also remove the grid lines so i'll select the gridlines and hit delete so you have also removed the grid lines from here now let's also remove the white outline that we have so i'll select this chart go to format and here i'll go to shape outline and i'll select no outline you see this so we have our total revenue by year which is a line chart and this is completely transparent now now what i'm going to do is i'll place this chart over a box so i'll go to insert and in insert we have the option to create a shape so i'll click on illustrations and here i'll choose a shape and let me select a rectangle so i'll just create a rectangle here all right and now what i'll do is i'll select this and bring this to front i'll right click and choose bring to front and i'll place this shape below it all right now the next thing is to edit the shape so first i'll change the color of this box so let me select this blue color and and let me increase the transparency so i'll right click and go to format shape here i'll increase the transparency let's keep it to 25 percent or let's say 20 percent all right next thing we'll just convert all the font to white color including the access labels the chart title will also convert all the access labels to white color so it looks better now we'll just adjust our chart over here next thing let's just remove the outline so i'll go to shape outline and i'll select no outline you see we have now formatted our chart let's just pull this a little up all right now we'll add this blue background to all the other charts so we'll first add the background make it transparent and then we'll convert the font text to white color to make it more readable and visible so for the time being i'll just pause the video and come back again all right so now you can see on your screens we have nicely formatted our dashboard so i've added a few logos for each of the charts you can see the logos here so for revenue by countries we have a globe then if you see here this is kind of a map or a location similarly we have all formatted the color of the bars then we have also formatted the labels in terms of millions if you look on the y-axis even the revenue for year in region are all formatted in terms of millions if you want this you can also format the total year by revenue in terms of millions so the way to do is you can select this graph right click and go to format access here if i scroll down you have numbers and under category i'll select custom then i'll select my type as this format which is in millions and you see here we have successfully formatted our y-axis labels all right so the next thing is to add slicers and timelines to a dashboard now slicers are used to format your data based on a particular column suppose if you want to see revenue by certain items you can add item as a slicer and you can view the entire dashboard similarly for timelines you can add date columns so if you want to see what was the amount of sales or revenue generated on a particular year or a particular month you can do that using a timeline so i'll select one of the charts and then either you can go to the insert tab and here you can see under filters we have slices and timeline or if you go to the pivot chart analyze tab here also you have insert slicer and timeline option so i'll select insert slicer here first now it's giving me the list of fields present in the data set so i'll select country region and let's say we want to know by item type and sales channel so these are going to be my four slicers i'll click on ok you can see it here we have our four slicers here and these are the list of values on the region we have asia this europe north america and others similarly we have the different country names for country slicers and then for item type also we have all the items that were present in our data set now moving ahead we need to connect all the slices to a dashboard so what i'll do is i'll right click on this option and i'll go to report connections okay so under report connections you have all the pivot tables that we created you currently see only one of the pivot table is selected so we need to select all the pivot tables so let me check all the pivot tables present in this workbook and click on ok alright now that we have connected one of our slicers we'll now connect the other remaining slicers so i'll right click on this go to report connections and i'll check all the pivot tables present in this worksheet click on ok similarly let's do it for the country slicer i'll go to report connections and let me select all the pivot tables and finally we have the item type so i'll right click go to report connections and then i'll select all my pivot tables and let's hit ok all right now let me just organize this a bit so i'll place my pivot tables to the right i'll just reduce the size let me scroll down i'll add my region slicer here similarly i'll add my final slicer that is sales channel now in our next dashboard which is going to be the profit dashboard i'll show you how to add a timeline all right now i have arranged all my slicers so let's say you want to find the revenue that was generated for an item type let's say beverages so you can just select beverages here and all your charts so the respective revenues so you have the total revenue by year for beverages only similarly here you can see the revenue by year in region only for beverages item type if i scroll down now this chart represents the revenue that was generated in each of the countries only for item type beverages let me just uncheck it all right let's say you want to see the revenue generated for a country like india so i have selected india here and now you can see my graph has changed only for country india you can see here it is showing only for india now now similarly you can also filter your revenues based on the different regions let's say you want to know the revenue generated based on sales channel so we have two sales channel that is offline and online suppose you want to know the revenue generated offline so i'll just select offline you can see the values have changed so these were the revenues generated for each of the items only for offline if you see here now these were all the offline sales for the different regions so this is our entire excel revenue dashboard on sales we created multiple charts and graphs then we applied different formatting we added different icons then we formatted the labels also next we added slicers and finally we saw how we could filter our data based on these slicers likewise now we are going to create a profit dashboard based on the same data so before moving ahead let me rename this sheet as revenue dashboard i'll write rev dashboard okay now we'll move to our data sheet and start creating our pivot tables and vivo charts for the profit dashboard all right so let me go ahead and create my first pivot table so i'll create a new worksheet this time i'm going to create a line chart to visualize the profit for each year so i'll drag my total profit column to values and my order year to rows so here you can see we have our pivot table ready now you can sort this data to get an idea as to which year had the highest profit and which year had the lowest profit so from this pivot table you can see since i have sorted this data in descending order so 2014 had the maximum amount of profit and 2017 had the least amount of profit i'll just do control z to undo it all right now based on this pivot table let me go ahead and create my po chart so i'll go to recommended charts and click on a line chart so this is the preview of the chart i'll click on ok let me close this similarly we are going to edit this chart now so first i'll hide all the field buttons present on the chart and i'll rename the chart title as total profit by ear next i'm going to remove the legend so i'll delete this let's do some more formatting so i'll go to style and this time i'm going to select my style type okay and if you want you can choose the colors as well for the timing let's have this yellow color next let me add the data labels so again if you see here this is not formatted properly so let's go ahead and format the labels so i'll click on number and i'll select custom here and the type i'm going to select is in millions and i'll click on close so here you can see we have our line chart ready which shows total profit by year let's rename this sheet as profit by all right now let's move back to our data sheet again next we are going to show the total profit by countries for this i am going to create a map so let me first create my pivot table so i'll go to insert and i'll click on pivot table let me click on ok since i want the country names so i'll select country under rows and then i have my total profit under values the next thing i'm going to do is i'll just rename the row labels as countries and then i'm going to delete the grand total which you can see at the bottom so here we have the grand total let me just delete the grand total so i'm going to select this pivot table go to the design tab here we have subtotals and grand totals i'll switch off the grand total let me just verify it again i'll scroll down you see the grand total has gone now all right now we want to create a map out of this the way to do is i'm going to select my data copy it i'll go on top and i'll paste it here using this data i can create my field map now so i'll go to insert here we have the option to create a field map there you go you can see we have our map ready i can expand this as you can see our map has a color scale which comes from light gray color to dark blue color so the countries that are in gray or you can say light blue have the lowest amount of profit while the regions or the countries that have been shaded in dark color a dark blue color have highest amount of profit i will go ahead and delete this scale okay next we need to connect this map to the original data source so what i'll do is i'll right click on this map and i'll go to select data here instead of the previous range i'll give my new range now so my new range will be my original pivot table that i had created i'll go on top and click on ok so we have our map ready now now if you want you can change the color of the shade so i'll just go to colors and let's say we'll keep green color so the countries that are shaded in dark green have the highest amount of profit while those which are highlighted in light green color are are the countries that made least amount of profit okay now moving on next we want to create a pivot table that will show us the profit by year and sales channel so for this we are going to create another line chart so i'll go to insert and click on pivot table so i'll select new worksheet here since i want to know the profit by year first of all i'll drag my order year column to rows and then i'll choose my total profit column under values next i'm going to select my sales channel under columns there you go so we have our pivot table here now based on this pivot table let me create my pivot chart so i'll go to recommended charts and i'm going to create a line chart i close this you see here based on this chart you can tell the profit generated with online sales were actually lower than that of offline so here the blue line represents offline profit and the orange line represents online profit if you mark clearly in year 2012 the online profit was actually higher than the offline profit so let me go ahead and edit this chart a bit so we'll delete the field buttons i'll also delete the legend for now let me go ahead and add a chart title so i'll write profit by year and sales channel okay so this is my second report before moving ahead let me just rename this sheet so i'll write profit by countries similarly let me rename this sheet as profit by year and let's say sc for sales channel okay moving ahead now i want to create a pie chart based on a pivot table that will show the profit by sales channel only so this is going to be a simple pie chart so i'll first go to insert click on pivot table and click on ok so i'll drag my sales channel under rows and then will have the total profit column under values so this is my simple pivot table now let's create our pivot chart which is going to be a pie chart let me explore the other types of pie charts we have okay so i'm going to select a donut chart here i click on ok let's edit this chart i'll remove the field buttons let me now remove the legend as well i'll just resize it and this is going to be profit by sales channel okay let's also add data labels and here again i am going to format this label i'll select the category as custom and my type will be in millions okay let me just move this to the left and this to the right okay let's also delete the lines cool now let me just rename this sheet so i'll write profit by let's say c which stands for sales channel cool finally i'm going to create a report that will show the revenue and profit by items so i'll go ahead and create my pivot table first this time i'll choose my total profit under values and will also have the revenue column so i'll put my revenue at the top then i'm going to select item type under rows so here is my pivot table based on this pivot table let me now create a combo chart so you can see the preview of the chart the blue bars represent the total revenue and the orange line represents the total profit i'll click on ok let me close this first let's remove the field buttons let's also remove the legend here then we'll add a chart title i'll name it as revenue and profit by items okay if you want you can also go ahead and change the color of the bars so let me just select one of these colors okay all right so we have our five reports ready that we are going to use for our profit dashboard next let's create a new sheet and we'll get started with building a dashboard so i'll click on a new sheet let me just rename this as profit dashboard all right we'll continue with the previous drill so first of all let's go to the view tab and remove the grid lines now we'll insert a background image like we did for our revenue dashboard so i'll go to insert under illustrations i'll click on pictures and select this device i'm going to have the same background i'll click on insert all right so you can see we have a picture of a company or you can say an organization let's just drag this a bit to the right we'll adjust the size also all right now let's copy the title of my profit dashboard so here you can see i have brought my revenue dashboard and i'll copy the title and the logos that we used for the revenue dashboard i'll paste it on my new dashboard let's just align it in the center all right the next step let me now go ahead and edit the title so this is actually going to be excel profit dashboard instead of revenue now we'll copy each of the charts that we just created for example the revenue and profit by items then we had profit by sales channel all this we are going to copy one by one and put it on the profit dashboard so let me just copy a few now we'll paste it here and later on we can make the adjustment copy this as well similarly i'll bring the other three charts on to my sales dashboard okay so here on my profit dashboard i have added all the charts and have aligned and reshaped it so that it looks good i have also made some formatting for example have reduced the size of the chart title now let me go ahead and show you a few more formatting that we also did for the revenue dashboard first let's remove the white background from all the charts so i'll select the first chart i'll right click and i'll click on format chart area here under fill i will select no fill next i am going to remove the grid lines so i'll just delete it let me close this now we also have a outline so i'll go to design actually format and i'll remove the outline next i'm going to add a blue box at the back like how we did for the revenue dashboard so let me select a blue box from here and i'm going to paste it here okay let me just select the chart and i'll bring this to front and i'll move this to the back next i'm going to change the font color all to white so that it's clearly visible and it's more readable i'll do it for the x-axis as well okay so here i have my first chart ready the same i'm going to do for the rest of the charts okay so now you can see here i have formatted all my charts i've also added a blue background you see here i have also formatted the y labels in terms of millions which is actually the profit similarly here i have added the data labels this is for revenue some of the charts also have the data legends so here you can see the blue color represents offline and the red represents online similarly here you have the legends i've also formatted the map as well okay now the next thing is to make this dashboard more interactive so we'll add our slicers as well as timeline first let me show you how to add a timeline so i'll select one of the charts and i'll go to insert under insert i have the option to create a timeline so i'll just click on timelines so timeline is actually based on date columns so since in our data set we only have two date columns one is order date and one is ship date so excel has only shown us two columns so i'm going to create my timeline based on order date so i'll select my order date column and i'll click on ok you can see here this is called a timeline i can expand this now this timeline is based on months now if i scroll this timeline you can see here i have my order year 2010 and i have all the 12 months similarly we have for 2011 then we have for 2012 all the way till 2017. now you can filter this in terms of years quarters months or days let me just select year now so i have years from 2010 till 2017. let me just squeeze this and i'll place it somewhere here on the right now let me go ahead and create a few slicers for my profit dashboard so i have selected one of the charts under insert i'll click on slicer you can see it gives me the list of columns from which i want to create slicers so i'll create a region let me also select country let's say i want the representative's name or the customer's name and i'll click on ok so here i have created three slicers let me first resize it and i'll place it on the right similarly i'll place the country column also then we have the region slicer i'll resize this and i'll bring it here okay the next thing we need to do is i have to connect all the slicers and the timeline to the pivot tables for the profit dashboard so i'm going to click on the multiple select option and go to report connections here i'm going to select all the pivot tables that are related to profit so here i have selected four and i need one more which is pivot table number 10 i click on ok similarly let's create or connect my region filter to all the pivot tables so i right click go to report connections here i'll choose all my pivot tables which are based on profit i'll click on ok let's do it for the country slicer as well click on ok and similarly i'll connect my timeline as well i'll go to report connections and i'll select all the pivot tables related to profit then i'll click on ok let me now go ahead and create another slicer based on sales channel so i am selecting one of the pivot charts i'll go to insert click on slicer and i'll select sales channel and hit ok so i have my sales channel slicer now let me connect it to all the respective pivot tables that are based on profit click on ok now let me just bring it here all right the next thing i want to show is how are we going to use the timeline first so you see we have all the years here from 2010 till 2017 now suppose you want to know the profit that was generated in the year 2012 so i'll just click on this range and now you can see our charts only so information for 2012. so this dot represents there was 51 million profit in the year 2012. similarly you can see here the profit by sales channel for 2012 from the map you can see the different countries and the profit each of these countries made in 2012. if i scroll down you can see the revenue and profit by items now if i select another year let's say 2013 i can just drag this to the right and now you can see our profit by year and sales channel for offline and online you can see the map or the line chart for total profit by year so in 2012 it was 51 million and then it went up to 54 million in 2013. similarly our map has also changed now this is a sort of an information that we have you can click on this and check the information that excel has prompted all right so this is how you can use a timeline now as i said we checked by years you can also see it for months and quarters as well let me just uncheck it i'll send it back to the place where it was and i'll reduce the size okay now suppose you want to check the profit made by different representatives you can select them one by one let's say adam churchill this is the profit generated by adam churchill similarly you can select multiple persons as well now suppose you want to see the profit by different countries so you can use the country slicer let me just bring this to the middle and let's expand our chart a bit okay so here you have the profit by different countries chart i'll just bring this to the front so that you can see it clearly okay now here suppose you want to see the profit generated in let's say united kingdom you can select united kingdom so this is the map of united kingdom and it tells you the total profit that was generated in united kingdom and below you can see the revenue and profit for all the items that was sold in united kingdom so you had beverages clothes household office supplies so you can see clearly office supplies item made the highest amount of profit in united kingdom now you can also select multiple countries let's say i want to know for france as well so my map will change accordingly so now i have united kingdom and france selected and the other charts present in my dashboard change accordingly now i have my country selected as india you can see the map of india here and these were the respective profit values now one thing to note here is this is actually not millions that should be in k that is thousands so please mark this as 1000 and not in millions even for this this is actually k and naught million all right so we have successfully created our second dashboard that is on profit let me just resize this a bit and we'll place it where it was earlier cool so we saw how to create different pivot tables in pivot charts and then we formatted our pivot charts based on our requirement we saw how to edit the colors now let me show you one more thing you can also change the look and feel of the dashboard by going to the page layout tab under page layout you have themes so here you can select different themes currently we are with the office theme now let me just select another theme let's say facet you see the colors have changed and it looks really beautiful similarly let me try out another theme let's say organic you see our chart has changed let me just delete this okay so now once you change the theme the text also change a bit you can see the slicers are in a different font let me explore one more theme let's say this time i am going to choose depth and this is more of a green type of color you can play around and select whatever theme suits the best all right now let me just move back to my revenue dashboard and see how it looks there you go so since we changed our theme even our revenue dashboard is also impacted so this is how it looks now you can always go ahead and play with different themes colors fonts and effects all right so in this demo we saw how to create a revenue dashboard so we created line charts this combo chart pie chart horizontal and vertical bar charts and then we learnt how to add slicers and connected to different pivot tables and we filtered our data to see revenue as well as profit by items by countries by different regions sales channel we learnt how to create a map and lots more let's quickly see some more examples of doing data analysis using excel and for that we can use some inbuilt add-ins which can be added to our excel sheet so for example if you would want to do a descriptive analytics or descriptive analysis on your data say for example getting your descriptive statistics such as your mean median mode and so on so we can do that and we can use excel for it so for example you can if you are given some data say i have temperature price of ice cream units sold and i would want to have descriptive statistics on this what i can do is i can click on file and here in file you can click on options and within options click on add-ins now within add-ins you have excel add-ins which is selected here so click on say go for example and that shows what add-ins are available and you can choose which ones are you interested in so for example i have chosen analysis tool pack and solver add-in and click on ok now that basically should add more options to your excel so if you click on data so here you see data analysis and solver and this is what we would want to use to get our descriptive statistics for these three columns so for example let's say temperature or you can even give the names later once you get your descriptive statistics so for example let's go for data analysis and here it says what are you interested in there is a two factor with replication you have correlation covariance descriptive statistics you have histogram so let's click on descriptive statistics click on ok now this one basically asks your input range so while your cursor is blinking here also it is said grouped by so let's give it a range so for example i will say say temperature now if i do this and i have selected the heading just look at that and now you need an output range so let's just select this and then you can have your cursor blinking here let's select save fields here and this is where i would want the output now it also said what options do you want so it has output range we can then select summary statistics confidence level so i will say summary statistics is what i'm interested in say okay and this says input input range contains non-numeric data now why is that because we chose temperature the heading also so click on ok and here we will alter the range so this one is our range should be only the values numeric values on which we would want the descriptive statistics we have output range already selected we have summary statistics and now you can click on ok and that basically gives your descriptive statistics for temperature so here i could basically given a value for this so i can say temperature and that's my descriptive statistics for temperature might be i can just do some formatting and that's it so that gives me descriptive statistics for the values here now similarly we can do it for price of ice cream so what we need is we need to basically go for data data analysis descriptive statistics say okay now you need to give a range so here i will change my range to these values output range is already selected now we are interested in summary statistics click on ok and this says output range will overwrite existing data press ok to overwrite data in range i will say cancel now that's not what we want to do we need to give a new range so let's select our new range which is here and now click on ok so now we get the values which is for your price of ice cream so again we can basically select this and say price of ice cream and we got our descriptive statistics for price of ice cream and like we did earlier i can select this i can basically do a merge and center and that gives me descriptive statistics for price of ice cream so we could also basically change this now i can go into data and i can go into data analysis descriptive statistics so we know that we had selected this b2 to b8 and this one which is h6 to h19 we would want to shift it might be two columns up so might be i can just say h5 and i can manually change it to at 17 and let's say okay and we will basically get this and i can get rid of this so i can have it in the same range so similarly so this one will have to be renamed and i can basically say price of ice cream and that's basically my descriptive statistics for my price of ice cream and similarly we can do it for the third column which is units sold so we would want to have this now let's see we can click on data we can click on data analysis descriptive statistics so we need to give the range correctly so this time our range changes to units sold now we can also say labels in first row okay if we were selecting the heading so let's do it in this way so in my range in my range let me empty this i can basically select this which we know has non-numeric data in the first row say for example i'll say labels in first row i'm interested in summary statistics and this range will now have to be changed from h to basically something like j so let's say j and let's select these values so that should take care of things and now you see you have your units sold you did not have to manually rename it and you have basically got the descriptive statistics so this is how you can simply perform analysis using data analysis here you can basically get your descriptive statistics for your columns and then you can do whatever needed formatting you need to basically make your data look in a good way now let's look at one more example of data analysis where we may want to look at the frequency of values or frequency of values occurring in a range of values so for example if you have been given temperatures you have been given some pins where you would want to identify how many values fall into the range of 0 to 20 20 to 30 30 to 40 40 to 50 and the easiest way to do that would be creating histogram now histogram is usually used for data analysis where you would want to look at different variables or say features for example temperature is one such feature might be there might be one more variable or feature such as sale of ice cream and you would want to see if the increase or decrease in temperature affects the increase or decrease in sale of ice cream might be a sale of ice cream is a is a response based on temperature so it depends so sometimes you may want to find a relationship between two variables whether they are positively or negatively related or you would want to do different kind of analysis and in certain cases we may want to first do analysis on one single variable look at the frequency of values might be also look at the defects and for which we can use something like pareto chart so we can go for histogram and that basically gives us the frequency of values now how do we do that so we have already added the add-in which is data analysis earlier so we can just use the same thing again here we would want to create a histogram so let's say okay now i have already selected input range so if you see here my input range is temperature which is also with the headings and i have bin range which is basically the range of values so for example i can select this and that's my bin range i am selecting or the option labels because i'm using the first row which has the heading such as temperature and pins now we need to give an output range so for example let's say i would want my data here and that becomes my output range so you can have a sorted histogram or basically a pareto chart so if that's what you are interested in looking at the frequencies for your different ranges and here i am also selecting chart output because i would want to have a visual histogram which gives us the frequency and it's as simple as this just click on ok and now you get your bins so it basically tells you frequency of values which is basically 20 but that does not mean it is only talking about the value is 20 it is basically talking as a range of 0 to 20 so we have 0 to 20 that is 2 so we can basically say there is 120 here with the 20 being the maximum value and then there is one more 20 so that's your 0 to 20 then you have 20 to 30 which shows three values so might be in that case i can say 26 is one thing then i can say 30 that's the second one and then basically i can look at 22 so basically this one does not select 20 as the lower range but it basically selects 30 as the higher range so i do see 20 to 30 there are three entries similarly we can see values for 40 and 50 and since we have selected pareto or sorted histogram that shows in a descending order what is the highest frequency of values within a particular range so that shows me highest frequency is five and then you have three and three and then two so this is how we can create a histogram and we can perform analysis on a single variable now as discussed earlier as i said sometimes we may be interested in finding out the correlation between different variables such as say here we have temperature price of ice cream and units sold and we may want to find out the correlation between one variable to another variable or we would want to find out the relationship between variables are they linearly related are they positively related negatively related and so on and for that we can use the correlation of your data analysis add-in so for example you want to find out correlation of temperature and units sold and what we can do is we can find out that using a formula so for example if i search for something like co relation and let's search so there is a function called correlation which we can use and we can use this to calculate the correlation of temperature and units sold so for example let's select this and that's the function so it says give me an first array and a second array so we are interested in finding out correlation of temperature and units sold so let's select the range of values for temperature and then i am interested in finding out the correlation of temperature in units sold so let's select this and that basically gives me a range of values it gives me the correlation value which is 0.2859 say okay and that's your value so similarly we can do it for temperature and price ice cream so let's go for correlation so that's the function we are interested in you need to give a range of value so here we are interested in temperature and price of ice cream so let's select temperature and then the second array or list of values is price of ice cream let's select that let's close our bracket and here we have the correlation value of temperature and the price of ice cream similarly you may be interested in finding out temperature and units sold like what we have done earlier so we can do the same thing based on function so this is same as correlation of temperature in units sold so i can get rid of this one now how do i do it using the data analysis add in so for that what we need is we need to go into data we need to click on data analysis and here you have the option called correlation let's select this now that basically needs an input range so we need the range now i might be interested in finding out the correlation between temperature and price of ice cream and units sold so i've selected all the columns here we will say group by columns obviously we need to select labels in first row because that is basically taking care of the first row is heading now output range you can just give one simple cell and that's where your data will start from or you can give a new workbook so click on ok and that basically gives you that basically gives your correlation of your different variables and what are the values and we can check these values based on the values what we have here so we have basically temperature and price of ice cream and that basically shows me 0.96149 you have temperature and units sold so you have 0.2859 now you can also look at units sold and say for example price of ice cream you can look at these particular values so if i would be interested in finding out what is the relationship between these variables i can easily find using correlation so i could be basically writing in a formula here and selecting what are the cells so here we were selecting a2 and c here we were selecting a and b now might be i'm interested in price of ice cream and units sold and if that's what i'm interested in then i will give a range of b2 to b8 c2 to c8 and similarly you can get your analysis or correlation values so it's very simple in excel and you can use either the data analysis tab and get your correlation or you can use formulas and do that now one more important part of data analysis is doing your sampling now sampling could be periodic sampling or random sampling so sometimes you may want to look at a variable and you may want to get some values based on periodic data that means might be i'm interested in range of values i'm interested in seeing a sample of values for a particular period which could be basically a range of values or you could just do a random sampling so for example if i go for periodic sampling so out of these values which i see here might be i want to see say periodic sampling that is a frequency of two values how many times we have these values occurring here or i would go for random sampling so basically randomly i would want to pick up say three temperature values now how do i do it so for example here i have seven values now if i go for periodic sampling the sample or periodic sample value which i need to give has to be lesser than the total input values so for example we can do this let's go in here and let's go for data analysis so we can go for sampling here click on ok and that needs a range of values so we will select a2 to a8 now i could have selected all the values for this one temperature and in that case i can give labels which is going to take care of the first row now here we can go for number of samples which we are interested in or giving a period so let's go for period and say for example i have seven values so what if i select five so for example if i say five that means i could just get one value so basically when i'm saying five out of seven so that's just giving me out of five i want one value so i can then just give an output range so here i can basically select this cell i'll say okay and now you see it just shows me one value so out of the first range that is i've said five it has given me the fifth value that's your periodic sampling so for example we want more values so let's reduce this period to might be 2 which basically gives me every second value so i can basically say for example 2 and say ok and then say ok so that shows me 26 then you have 35 then you have 40 and then well this one does not have any more values so that's your periodic sampling now if you go for random sampling that's basically randomly picking up values and you can choose how many values you want so go for data analysis go for sampling i'll go for number of samples how many you want so for example out of seven values randomly i want three values and i can just give this say okay and then we will do a cancel because we need to change that range so let's select this and say okay and that gives me random three values from this values of temperature so we can use excel to do a simple sampling and we can choose whether we would want to go for p hello everyone and welcome to this interesting video tutorial by simply learn today we are going to perform two hands-on projects on cove data analysis using python and tableau this is going to be a really interesting and fun session where i'll be asking you a few generic quiz questions related to coronavirus please make sure to answer them in the comment section of the video we'll be happy to hear from you covet or corona virus is an ongoing global pandemic of coronavirus disease that emerged in 2019 and was first identified in wuhan china it is defined as an illness caused by a novel coronavirus called severe acute respiratory syndrome coronavirus 2 commonly known as sars cove ii on march 11 2020 the who declared coven 19 a global health emergency the virus has so far infected over 22 crore people and killed more than 4.5 million innocents in india there have been over 3.3 crore confirmed cases and nearly 4 lakh 41 000 deaths have been reported so far this data is according to official figures released by the union ministry of health and family welfare as the world tries to cope up with this deadly virus we request all our viewers and their family members to follow all the necessary precautions to avoid getting infected quiz time now let's see our first quiz in this project what does corona in corona virus mean here are the options is it a beer b respiratory is it c crown or is it d sun this is a very generic question i am sure a lot of you may already know some of you may not please let us know your answers in the comment section of the video we'll be glad to hear from you now in this video we will use three different kuvin19 data sets and perform data analysis using python and tableau the project will give you hands-on experience working on real-world data sets and how you can use the different python libraries to analyze and visualize data and draw conclusions you will learn how to create different plots in tableau and then make a dashboard from the visuals the project will give you an idea about the impact of coronavirus globally in terms of confirmed cases that's reported the number of recoveries as well as active cases we will also see how india has been affected since the pandemic started and dive into the different states and union territories to learn more about the covet 19 influence and the vaccination status first let me show you the two data sets that we'll be using so for our first project using python we'll be using the first two data sets kovid 19 india and covet vaccine statewise let me open the two data sets okay so this is the first data set you can see here we have the date we have the time then we have the different state names scroll down you can see we have kerala tamil nadu delhi haryana rajasthan punjab telangana and other states then we have something called as confirmed indian national so actually this two data sets confirmed indian national and confirmed foreign national we won't be using so in the demo itself we'll be dropping these two columns what we are concerned about are the last three columns the cured cases or the recoveries the number of deaths reported and then we have the total number of confirmed cases let me just sort the b column that is the date column so that you have an idea about the recent data we have i'll continue with the current selection and sort it you can see here this is till 11th of august 2021 so this data was collected from kegel it has some discrepancies that we will see in the demo the data is available for free we will provide the link to the data sets in the description of the video so please go ahead and download them the visualizations and results that you will see in the demo are based on the datasets that we'll be using we haven't preprocessed the data to remove outliers or any missing values now before i jump into the demo let me show you the second data set that we are going to use okay so covet vaccine state wise is my second data set let me open it there you go so this is the second data set that we'll be using in the python project you can see we have a column called updated on then we have the state again you can see here there are a few discrepancies here uh it has taken the country name and not the state name below you can see there are the different state names and you can also see we have information about the total doses administered we have the sessions sites first does administered then we have the second dose then we have information about male and female doses you can see here the different vaccines administered covaxian and kobe shield you have sputnik v and here are the different age groups as well and finally if you see we have male individuals vaccinated the total number of female individuals vaccinated for a particular day we have information about transgender individuals vaccinated and finally we have the total individuals vaccinated each day all right before we jump into the hands-on part let's have a look at the second quiz in this project so here is the second quiz question which is the first country to start covet vaccination for toddlers is it a japan b israel c portugal or is it d cuba this is a very recent development that took place if you watch daily news updates on coronavirus you will be definitely able to answer the question please give it a try and put your answers in the comments section of the video it is really important for our viewers to know the right answer all right so now let's begin with our demo so i am on my jupyter notebook so the first project we are going to use python jupyter notebook i'll just rename this notebook as kovid data analysis project and click on rename all right so first and foremost we need to import all the necessary libraries that we are going to use so first i'm importing pandas as pd this is for data manipulation then we have numpy as np numpy is used for numerical computation then we are importing matplotlib c bond and plotly these three libraries will be used for plotting our data and creating interesting visualizations finally i'm also importing my datetime function all right so i'll hit shift enter to run the first cell now it's time to load our first data set which is related to the kovid19 cases in india for the different states and union territories so i'll create a variable called kovid underscore df dfs for data frame i'll use my pandas library and then give the read underscore csv function since our data sets are csv files and inside double quotes i'll pass in the location where my data sets are present so i'll just copy this location i'll paste it here we'll change the backslash to forward slash and after that i'm going to give my file name followed by the extension of the file i'll say covaid 19 india dot csv let's go ahead and run it all right now to see the first few rows of the data frame i'm going to use the head function i'll say head and within brackets let's say i pass in 10 which means i want to see the first 10 rows of data if i run it there you go here you can see from 0 till 9 so we have 10 rows of information and these are the different column names you have s number date time state or union territory then you have confirmed indian national confirmed foreign national cured cases debts reported and the confirmed cases all right now moving ahead let's use the info function to get some idea about a data set if i run it you can see here it gives us the total number of columns we have nine columns the total number of entries or the rows we have eighteen thousand hundred and ten rows of information starting from zero till eighteen thousand one hundred and nine you see here the different types of variables or column names that we have then it has information about the memory usage as well and this side you can see the data types cool now we'll use another very important function which is to get some idea about statistical analysis the basic statistics about your data set for that i'll be using the describe function okay so if you can see here the describe function is for numerical columns only and you have the measures such as count the mean standard deviation maximum minimum the 25th percentile 50th percentile and the 75th percentile value okay now let's move ahead and import the second data set which is related to vaccination so i'll create a variable called vaccine underscore df i'll write pd dot read underscore csv function which is present in pandas library i'll move to the top and i'll copy the file location and we'll change the name of the file copy this and let me paste it here and instead of kovid underscore 19 underscore india i'm going to say governed underscore vaccine underscore statewide okay so this is the data set that we saw covered underscore vaccine underscore state wise all right let me run it cool and let's display the first seven rows of information from this data frame i'll be using the head function and inside the function i'll pass in seven there you go so here you can see we have from zero till six there are total 24 columns a lot of them have null values you can see here all right now from the first data set which is the covered underscore df data frame we'll be dropping a few unnecessary columns such as the time column confirmed indian national and confirmed foreign national as well as the s number we don't need these columns so it's better to learn how to drop the columns for our analysis so i'll say kovid underscore df dot i'll use the drop function and within square brackets i'll pass in the column names the first is s number i'll give a comma within double quotes i'll say the next column is time my third column that i want to drop is confirmed indian national give another comma within double quotes i'll say confirmed foreign national outside the square brackets i'll give another comma and pass in my next argument that is in place equal to i'll say true i'll give another comma and say access equal to one let's run it okay it has thrown an error let's debug the error says confirmed indian okay it should be indian national and not indian nation i'll just mention it as national will run it again okay now we have removed these four columns let me show you the data set now there you go so we have only the date column state or union territory cured that's and confirmed now let's see how you can change the format of the date column for that you have the function called to datetime i'll say covid underscore df i'll pass in my column name that is date i'll say equal to pd dot i'll use the pandas function that is to underscore date time i'll say kovaid underscore df which is my dataframe name pass in my variable which is date give a comma and i'll use my argument that is format equal to i'll say percentage y give a dash say percentage m give another dash and say percentage d let's run it and i'll print the head of the data frame cool now moving ahead now we will see how to find the total number of active cases so active case is nothing but the total number of confirmed cases minus the sum of cure cases plus that's reported so let's find the active cases i'll give a comment okay i'll first write my data frame name that is covered underscore df within square brackets i'll give my new column which is active underscore cases i'll say equal to [Music] kuvit underscore df and my first column would be the confirmed cases column minus i'll say kuvid underscore df then i'll pass in my cured column plus i'll again say kovid underscore df and add my deaths column this time let's print the last five rows from the new data frame that we have created okay let's run it it has thrown an error let's debug it says dataframe object has no attribute tails this should be teal there you go you can see here we have added a new column called active cases which is the confirmed cases minus the sum of cured and that's reported column now we will learn how to create a pivot table using the pandas library so in this table we'll be summing all the confirmed deaths and cured cases for each of the states and union territories so we'll be using the pivot underscore table function for this i'll create a variable called statewise and say pd dot i'll use the pivot underscore table function i will pass in my data frame that is covet underscore df and then give my values parameter inside the square brackets i'll pass in my columns confirmed deaths and then we'll have the cure column i'll give a comma and next argument would be index which is going to be my state slash the union territory column let me bring this to the next line so it is more readable i'll say union territory i will give a comma now and pass in my last argument which is agg function that means aggregate function and this function would be max all right let's run it okay now i'm going to find out the recovery rate so recovery rate is basically the total number of qr cases divided by the total number of confirmed cases into 100. so i'll say statewide and within square brackets i'll pass in my variable that i want to create which is recovery rate this will be equal to the cured cases multiplied by hundred by the total number of confirmed cases within square brackets i'll give my column as confirmed let's run this okay i'll just copy this column paste it here so this time we are going to find out the mortality rate so mortality rate is nothing but the total number of deaths divided by the total number of confirmed cases into hundred so i'm just going to replace the names here i'll say mortality all right and then instead of cured i'll say my deaths column in 200 divided by the confirmed cases let's run it okay now we are going to sort the values based on the confirm cases column and we'll sort it in descending order so let me show you how to do it i'll say state wise equal to i'll use the function short underscore values so i'll pass in my variable statewise dot and use the short underscore values function i'll say bye i want to sort it by my confirmed cases column give a comma and i'll say ascending equal to false let's run it now we are going to plot our pivot table using a nice visual so for that i am going to use my background underscore gradient function and inside that function will pass in our cmap parameter i'll show you how to do it say style dot background underscore gradient and inside this we'll pass in a parameter called cmap so c map stands for color maps this is present inside the matplotlib library so here you can see there is a nice documentation on choosing color maps in matplotlib this is provided by matplotlib.org if i scroll down you can see there are a number of c maps that you can use purples blues you have something called reds there are other things like magma you have summer autumn spring winter cool all these you can use whichever color map that you want and here you can see the different shades or the gradients okay so i am going to use my color map as cube helix let me run it and show you the pivot table there you go here you can see we have our pivot table ready now as i said in the beginning there are a few discrepancies in the data set so here you can see there's one called maharashtra and there's also maharashtra triple star this you can ignore even if i scroll down you have madhya pradesh followed by three asterisks you can ignore this value as well even for bihar we have so these have been duplicated and here you can see the different state names and union territories then on the top you have the confirmed cases cured cases the debts reported and the new columns that we created these are calculated columns recovery rate and mortality rate and we have ordered it in descending order of confirmed cases so so far our data says that maharashtra has the highest number of cases followed by kerala karnataka tamil nadu andhra pradesh and uttar prades so these are the top five states which have the highest number of confirmed cases even if you see the mortality rate is also high for maharashtra and if i scroll down the mortality rate is also high for uttarakhand if you see here if i scroll further your punjab the mortality rate is also high all right so this was our first visual that we created in the kobe data analysis project now moving ahead we'll see the top 10 states based on the number of active cases so we'll start i'll give a comment top 10 active cases states okay so we are going to explore another very important pandas function in this which is known as group by so i'll first pass in my data frame which is kovid underscore df dot followed by the group by function i'm going to group my data based on state slash union territory column then i'll say dot max which is to find the maximum value from the states that have the highest active cases so i have passed in my active cases column and we're also going to group it based on the date column after that we are going to sort the values so i'll use my function that is sort underscore values i'll say short by my column that is active cases let me bring this to the next line underscore cases give a comma and say ascending equal to false say dot and then i reset my index for that i'll use reset underscore index function okay let's check if everything is fine i have missed a square bracket here let me give another square bracket here okay and all this we are going to store in our variable called top underscore 10 underscore active underscore cases okay now let me go ahead and run this cell okay there is a syntax error here now let's run it okay now i'll create another variable called fig here we will pass in the plt which is for matplotlib library and will give the figure size using the fig size argument it's equal to and within a tuple and passing the size let's say 16 comma 9 we'll run it okay and let's give a title to our plot so here we are going to create a bar plot so using plt.title we'll pass in the title let's say top 10 states with most active cases in india i'll give a comma and pass in the size of my title let's say 25 okay you can see we are slowly being able to create our graph the most important thing is to pass in the x-axis and the y-axis i'll say ax which is for axis to pass in the axis i am going to use the bar plot function that is present inside the c bond library i'll say sns.bar plot i'll say data equal to my variable which is top underscore 10 underscore states this is actually active states all right and now i'm going to use the i lock function and take the first 10 states so i'm using a colon and then giving 10 as my value iloc is for index location i'll give a comma and say my y axis to be active cases give another comma and say my x-axis to be state slash union territory give another comma we are going to pass in the line width to 2 i'll say line width equal to 2 and will give an edge color let's say the edge color is red ok so i have my axis defined now let's run it there's an error here it says stopped in active states not found let's go to the stop and see the exact okay so this is top 10 active cases let me change it to cases here now run it okay the x axis also has a mistake this should be state slash union territory now let me run it there you go we have our plot created but as you can see here the labels of the different states and union territories are overlapping so for that let me first pass in the x labels i'll say plt.x label and my x label would be states then i'll say plt dot by label my y label will have the total active cases and finally i'll write plt dot show before i run it let's collate all the lines of code that we have written for our top 10 states for most active cases in india into one cell so what i'll do is i'll just copy this and we'll keep adding in this cell itself okay i'll go to the top next we'll copy my figure size and we'll paste it here next let's take the title and put it here give us peace and we are going to copy this cell and i'll paste it here all right now it's time to run it there you go you see it here we have a nice bar plot ready on the top you can see the title top 10 states with most active cases and you see the edges are in red color for all the bars on the x-axis you have the different state names maharashtra karnataka kerala you also have andhra pradesh gujarat west bengal and chateau's card so as you can see maharashtra has the highest number of active cases based on our data followed by karnataka caroline tamilnadu at second third and fourth place respectively and in the ninth place we have west bengal in the tenth place we have chattisgarh on the y-axis you can see the total active cases which are in lakhs okay now moving ahead now we'll see the top 10 states based on the total number of deaths reported so i'll give a comment top states with highest debts okay so i'll first create my variable saying top 10 underscore deaths this will be very similar to what we did here just that we need to change a few column names so instead of active cases we'll be using deaths okay so i'll start with my data frame that is covered underscore df followed by using the group by function and i want to group my data based on the state slash union territory column then i'll choose the max function and within double square brackets i'll pass in my column names deaths and date let's make it consistent i'll be using single quotes okay i'll say dot i'm gonna short my result therefore i'm using the sort underscore values function i want to sort it by deaths column i'll give another comma and say ascending equal to false which means i want to order my result in descending order then i'll say dot reset underscore index okay after that i'll give my figure size i'll say plt dot figure and within brackets i'll pass in my argument which is fig size equal to using a tuple i'll say 18 comma let's say 5. now let's give a title for that i'll use the title function which is present in the matplotlib library i'll say top 10 states with most deaths and let's give a size to the title 25 now it's time for us to give the access labels i'll just scroll down okay so i'll say ax equal to again this will be a bar plot so i'm using my c bond library followed by the bar plot function my data will be top underscore 10 underscore deaths which is this variable i'll give my index location i loc i'm going to choose 12 states the reason being there are some discrepancies in the data i'll show you once i plot this result so in the y-axis we'll have deaths column in the x-axis we'll take state slash union territory okay i'll give another comma and say line width equal to 2 and will give an edge color to our bars like we did here so i'll say edge color equal to let's say black okay now finally we'll give the x label the y label so i'll say plt dot x label x label would be states my y label will be total death cases then i'll write plt dot show now let's run it there you go you can see here we have a nice bar plot on the top we have the title top 10 states with most deaths now what i specifically wanted you to see was these discrepancies in the data you can see here maharashtra is repeated twice even in our data that we collected from kegel karnataka spelling has an error you can see here we have a few rows of information where karnataka is spelled as a a k-a-r-a-a-n instead of k a r n a all right so to remove these two results or to ignore these two results i had given my index location till 12 so we have maharashtra karnataka tamil nadu delhi then uttar pradesh west bengal kerala punjab andhra pradesh and chattisgard with the states that have the most number of deaths reported okay now we'll create a line plot to see the growth or the trend of active cases for top five states with most number of confirmed cases so the states are maharashtra karnataka kerala tamil nadu and uttar pradesh i can show you so these are the states with highest number of active cases okay hello everyone welcome to today's video tutorial by simply learn in this video we are going to perform a really interesting data analysis using python on the spotify music streaming service platform data set i'll also be asking you a few questions related to spotify during our discussion please make sure to answer them in the comments section of the video so now let's get started spotify is a swedish audio streaming and media services provider founded in april 2006 it is the world's largest music streaming service provider and has over 381 million monthly active users which also includes 172 million paid subscribers the total number of downloads on the spotify app in the android store exceeded 1 billion in may 2021 so millions of people listen to music all day even i am hooked to music as an analyst what's better than exploring and quantifying data about music and drawing valuable insights before i move ahead i have a twist question for you people the name spotify comes from a combination of two words so which are those two words please let us know your thoughts in the comments section below i would like to repeat the question again the name spotify comes from a combination of two words so what are those two words we would love to hear from you so please put your answers in the comment section below now let's use python libraries and functions to analyze and visualize our data set first i'll show you the two datasets that we'll be using so here is the first data set that we'll be using for our demo and then i have my second data set called spotify features which is essentially about the genres of the different soundtracks now these datasets have been downloaded from kegel.com now the links to the data sets have been provided in the description box please go ahead and download them now let me just go ahead and brief you about the columns that are present in our first data set which is about tracks we have our column a which is id this is the unique id for each of the songs then we have the name column which is essentially the name of the song then we have a column for popularity so the popularity ranges from 0 to 100 then we have duration in milliseconds this is the duration of the track in milliseconds next we have a column called explicit now we are not bothered about this column because we are not going to use it in our analysis then we have artist so the name of the artist who has composed or sung the song then we have id of the artist then we have a column for release date which is basically the date on which the song was released then we have a column for dance ability so this describes how suitable a track is for dancing based on a combination of musical elements such as tempo rhythm stability beat strength and overall regularity the value ranges between 0 and 1. next we have a column for energy so the energy is a measure between 0.0 to 1.0 and represents a positive measure of intensity and activity typically the energetic tracks feel fast loud and noisy higher the value the more energetic is the song then we have a column for key so key is the pitch notes or scale of song that forms the basis of a song there are 12 keys ranging from 0 to 11. moving ahead we have loudness so the overall loudness of the track in decibels it ranges from -60 to zero decibels then we have mode so songs can be classified as major and minor 1.0 represents major or one represents measure and zero represents minor next we have speechiness so speechiness recognizes the presence of spoken words in a track more exclusive speech like the recording example talk show audiobook or poetry the closer to 1.0 the attribute value then we have a column for the causticness so a confidence measure of 0 to 1 of whether the track is acoustic or not so 1.0 represents high confidence the track is acoustic then we have other information about instrumentalness then we also have a column for liveness so liveness detects the presence of an audience in the recording then we have a column for valence so balance is a measure between 0.0 to 1.0 and describes the musical positiveness conveyed by a track or a song and finally we have the columns for tempo and time signature now even in the second data set we have almost the same columns just that we have an additional column that is about the genre of the songs present in the data set cool now let's head over to our jupyter notebook and we'll start with our analysis okay so one more thing to remember our data has information from 1922 onwards so all the songs from 1922 till 2021 cool okay so i am on my jupiter notebook so you can see i have a few cells that have already been filled up so we'll start with our analysis first of all let's go ahead and import the necessary libraries so i'm importing numpy pandas matplotlib and c bond for my analysis and visualization i'll hit shift enter to import the libraries all right and in the next cell i'm going to load my data set using the pandas read underscore csv function i have my location already put here let me show you the location where the data files or the data sets are located so this is my location under chrome downloads i have a folder called spotify datasets okay so let's import and check the first five rows in the data set so for that i have used the head function there you go so here you can see i have my first five rows of information from the data set and on the top you can see the different columns you have id name popularity artist then we have the release date dance ability energy key loudness liveliness or liveness balance tempo and other information cool now let's check for null values in the data set i'll just give a comment as null values every time when you download a data set from an open repository there are chances that the data set would contain null values so it's better to check them beforehand so i'm going to use the is null function present in the pandas library pd i am using because i had imported pandas as pde so pd dot is null then i am going to use the variable name df underscore tracks because i imported my data set and stored it in the variable d of underscore tracks so i have my data frame under df underscore tracks variable and then i'll use the sum function to check the total number of null values present in the data set for each of the columns if i run it there you go so here you can see my name column has 71 missing values or null values and we don't have any null values for the rest of the columns okay now let's use the info method that will give us the total number of rows and columns in the data set and we'll also check the data types and the memory usage so i'll say my data frame name that is df underscore tracks dot info if i run it you can see here now if you mark there are total 5 lakh 86 601 names or the names of the songs present in the data set while the rest all have 5 lakhs 86 672 so clearly there are total 71 song names or sound tracks missing from our data set and below you can see the data types we have float integer and object and then you can see the memory usage cool now before i move ahead with our next analysis i have another question for you which artist or musician has the most number of followers on spotify i repeat which artist or musician has the most number of followers on spotify please put your answer in the comment section below we would be happy to hear from you now let's move ahead and do our first major analysis in this demo we are going to find the 10 least popular songs present in the spotify data set so i'll create a variable called sorted underscore df that will be equal to my data frame name that is df underscore tracks dot i am going to use the sort underscore values function and say my column name is popularity so i am going to sort the values based on the popularity and then say ascending equal to true since i want only the least popular songs and then i'm going to say head of 10 which means i want the top 10 least populous songs now let's go ahead and print sorted underscore tier if i run it you can see we have the list of 10 least popular songs on spotify you can see that popularity is zero and you can see the names of the songs some of them are songs which are not an english language and you can see the artist names as well cool now moving ahead let's see some descriptive statistics for the numerical variables that are present in our column so i'll say df underscore tracks dot describe which is the function to get some descriptive statistics and i'm going to use the transpose function after that if i run it there you go so we have the statistics about count mean standard deviation minimum value 25th percentile 50th percent 75th percentile and the maximum value for these columns like popularity duration in milliseconds then we have energy key loudness mode cool now if you see this popularity column the minimum value is 0 and the maximum value is 100 and you can see the 50th percentile is 27 which is essentially the median you have the standard deviation as 18.37 then similarly you can check for the other features as well cool now we'll see the 10 most popular songs which are greater than 90. so we are going to check for the top 10 songs with popularity greater than 90 let me show you how to do it so in this cell i am going to create a new variable called most popular and i'll say df underscore tracks dot this time i'm going to use the query function that is part of pandas library again i'll use the column that is popularity and we'll set the condition popularity should be greater than 90 i'll give a comma and say in place equal to false because i don't want to change my original data frame and then i'll say sort underscore values and i'm going to sort it based on popularity in descending order so i'll say ascending equal to false and then let's take only the top 10 popular songs so i'll say most popular use square brackets use square brackets and then pass the slicing operator and say colon 10 now if i run this there you go so here you can see the 10 most popular songs that is present in our spotify data set first we have features by justin bieber daniel cesar and gabon then we also have a song name called driver's license astronaut in the ocean save your tiers we also have the business streets and heartbreak anniversary so these are the most popular songs that are present in our data set based on their popularity you can see peaches has the highest popularity with 100 all right now moving to the next cell so here we are going to set the index to be release date column in the main data frame so i am setting my index using the set underscore index function i have passed in my column name as release date and i am saying in place equal to true which means i want to change it in my original data frame and then i'm changing the value to date time format and let's print the head of the data set so you can see here we have successfully changed our index now here you can see instead of 0 1 2 3 we have the release date column and the rest of the columns are intact cool now let's move ahead so suppose you want to check the artist at the 18th row in our data set you can use the index location method for that let me show you how to filter only specific rows of information from the data set i'll use my data frame df underscore tracks and using double square brackets i'll say my column name which is artist and i'll use the index location method and say let's say i want to check the artist who is present in the 18th row in my data set so i'll use ilok 18 if i run it the artist's name is victor voucher cool now let's move ahead we are going to convert the duration in milliseconds to just seconds so if you see our data set we have a column called duration in milliseconds so all our songs are present in milliseconds let's convert them into just seconds so for that i'm using the lambda function and dividing the values in by thousand so that they get converted into just seconds i have used increase equal to true so i want to change it in my original data frame let's run it and do the necessary changes all right now we'll print the head of the data set just to check the duration column so i'll say df underscore tracks dot duration dot head if i run it there you go so you can see the values have now been changed to just seconds cool i have the final quiz question for you who has the most monthly listeners on spotify please put your answers in the comment section below we'd be glad to hear from you now coming to the next cell so here we are going to create a first visualization that is going to be a correlation map we are going to drop three unwanted columns and those are key mode and explicit and then we are going to apply pearson correlation method now i have set my figure size to 14 comma 6 and then we are using the c bond heat map function to create our correlation map i have put the variable that is correlation underscore df you can see above we had created this and then i am setting annotation equal to true so this will write the data value in each cell i have set fmt equal to dot 1 g so this is a string formatting code to use when adding annotations then i have set my v min and v max so these are the values to anchor the color map otherwise they are inferred from the data and other keyword arguments cmap here stands for color map you can just search for sns c map you will get the documentation so you can choose whichever color palette or the color map you want here i have used inferno and i have set my line widths and line color finally i am giving a title to my correlation map and i have set the extract labels let's go ahead and run this to get our first visualization if i scroll down there you go we have got a nice correlation map so here on the right side you can see the scale it ranges from minus 1 to plus 1 minus 1 means the variables have least or negative correlation while the values which are above 0.0 means that the variables have a positive correlation now here you can see there are values like minus 0.7 for energy and acousticness which means if the energy is high the acousticness is really low again for loudness and say speechiness if the song is loud the speechiness is low so there is negative correlation but if you see for energy and loudness there is a really high correlation between these two variables you can see the value is 0.8 so if the song is loud this implies the song has really high energy and vice versa now if you see for a few other variables there is negative correlation between acousticness and danceability there is negative correlation again between balance and acousticness similarly there is positive correlation between energy and balance which is 0.4 and even for danceability and balance that is positive correlation which is 0.5 cool so from the correlation heat map you can note that acousticness appears to have a strong negative correlation with energy so if you see for acousticness and energy there is a strong negative correlation and there is a moderately strong positive relation between loudness and popularity so if you see for popularity and loudness the color is orange which means it lies in this positive region and there is also a moderately strong positive relation between danceability and balance so if you check for danceability and balance here there is a moderately strong positive correlation all right now let's move ahead we are going to sample our data and take just 0.4 percent of the total data and we'll create two regression plots using this data so let me first sample my data so i'll create a sample data frame using my original data frame which is df underscore tracks and say sample i'm going to use the int function and see 0.004 multiplied by the length of my original data frame which is df underscore tracks all right now let's run this and we'll print the length of my sample data frame if i run it you can see point four percent of my total data set is 2346 rows cool now we are going to create a regression plot between loudness and energy now in our correlation map we saw there was a positive correlation between loudness and energy which was 0.8 let's plot it in the form of a regression line so i'll use plt dot figure i'll set my figure size equal to let's say 10 comma 6. i'll say sns dot regression plot so i'm using the function called reg plot and i'll use my data as sample underscore df give a comma and say in my y-axis i'll have my column loudness and in the x-axis we'll have energy i'll give a color to my data variable let's say the color is c then i'll set my title to loudness versus energy let's say correlation all right let's make sure everything is fine now we'll run and see our result there you go so you can see here clearly there is a very high positive correlation between loudness and energy on the y-axis we have loudness and on the x-axis we have energy and you can see all the data points or the songs are in one direction so if the energy increases the loudness of the song also increases and similarly if the loudness of the song decreases your energy of the song or the track also decreases so there is a very high positive correlation and you can see the regression line here it has gone and is increasing gradually cool now similarly i'll just copy this code and we are going to see another regression plot this time for two different features let's say we have popularity in the y-axis so i'll say popularity and then in the x-axis we have let's say acousticness i'll change the color to let's say b which stands for blue and will set the title to popularity versus i'll have because thickness correlation scroll down and we'll run it to see the result there you go so i have the different points for the songs and here you can see the regression line is downwards which means if the acousticness of the song increases the popularity decreases and similarly if the popularity increases acousticness decreases you can see the downward trend of the regression line all right now in the next cell we are going to create a new column called year from our release date column and i have changed this to date time format let me just run it cool now after that we are going to create a distribution plot to visualize the total number of songs in each year since 1922 that is available on the spotify streaming app so i have used my c bond library and dist plot function now one thing to remember you need to update your c bond library to this version if you haven't done it so use this command pip install dash dash user cbon and the version so here in the distribution plot we are going to plot a histogram so i have used kind equal to hist which stands for histogram let's run and see the result okay so here you can see i have my distribution plot so the plot tells us that the number of songs for each year in the data set according to their release date have increased in the recent year since music became more accessible to people globally with technological advancements so earlier you can see there were very few songs available in the 1920s later on the number of songs increased rapidly and now you can see we have a lot more songs available for people to listen cool now we are going to see the duration of songs over the years for that again we are going to create a bar plot so i'll first create a variable called total duration equal to df underscore tracks dot i'll use the duration column that we created with seconds and then i'm going to set my figure dimensions so i'll use figure underscore dims for dimensions equal to 18 comma 7 after that i'll have my figure access defined so i'll use the matplotlib subplots function and i'll set the figure size equal to my figure dimensions and then i'll say figure equal to sns dot i'll use the bar plot function and say my x axis to be years my y-axis will be total duration or total underscore dr that we created here i'll set my axis equal to ex and then i'll set error width equal to false let's set the title for my plot as title equal to your bosses duration and finally i'll say plt dot x text i'll rotate it by let's say 90 degrees all right let me just recheck once if everything is fine and then we'll go ahead and run it my axis error width title okay let me just run it we'll see the result in a moment if i scroll down you can see we have the bar plot for the different years and the duration of the songs in seconds in the y-axis so earlier in the 1920s you can see the duration of the songs were less and later it increased around late 1930s and this remained consistent until 2010 where the duration was high but after 2010 you can see the duration of the songs have started decreasing now in the next cell i'm going to create a line plot to analyze the average duration of the songs over the years it is going to be similar to our bar plot just that now we are going to visualize it in terms of a line so i have my code ready you can see here i've used my c bond library and the line plot function in the x axis i have years and in the y axis i have total duration i've set my title to your voices duration and i'm rotating my x labels by 60 degrees let's run it and we'll see the output there you go now if i scroll down you can see we have a nice line plot and on the x-axis you have the ears and on the y-axis we have the duration we can see that the songs from 1920s to 1960s have comparatively shorter duration since most of the songs tended to be more singing beast rather than instrument beast after 1960s you can see the duration of the songs started increasing until i would say 2010 and in the present day the duration of the songs have started declining since the attention span of the average listener is also declining all right now let's move to our second data analysis project which is based on genres of the songs so i am importing my data set using the band as read underscore csv function i have given my location and here i have my data set name followed by the extension of the type so this is a csv data set let's go ahead and run it okay now let's print the first five rows of the data set i have stored my data set in a data frame called df underscore genre i'll use the head function to get the first five rows of information i'll hit shift enter to run it there you go can see here i have my genre column artist name track name track id popularity acousticness duration in milliseconds again and we have the rest of the other columns that we saw in our first data set just one thing to note here key is present in terms of c d e c minor f minor and not in terms of numbers between 0 to 11 now we'll see the duration of the songs for different genres for that i am going to create a bar plot so i'll start with setting the title for my plot as duration of the songs in different genres i'll use the c bond library and i'll set my color palette to let's say rocket will give a comma and say as cmap equal to true this would be color palette now i'll say sns dot bar plot in the y axis i'll have genre and in the x-axis i'll have my duration column which is in milliseconds so duration underscore ms and then i pass my data frame using the data argument so i'll say data equal to df.genr next we'll set the x labels so i'll say plt dot x label let's see my x label is duration in milliseconds and then i'll say plt dot by label as genres now let's go ahead and run this there you go so here you can see we have the different genres on the y-axis and on the x-axis you have the duration in milliseconds and if you see the graph for classical genre and for songs that belong to world genre the duration of the songs are more compared to other genres now if you check for children's music genre the duration is less or the least cool and finally we'll move to our last demo where we'll see the top five genres by popularity so i'll say sns dot set underscore style i'll set my style to dark grid which will be my background and then i'll say plt dot figure i'll set my figure size to 10 comma 5. then i'll create a variable called famous since i want to take only the most popular songs based on the genre so i have created a variable called femus and i'll pass my data frame name that is df underscore genre and i'll sort the values based on my popularity column so i'll say popularity and i want to sort it in descending order so i'll say ascending equal to false and i'll take the first 10 values i'll tell you the reason why i'm taking the first 10 values and not 5 then i'll say sn start bar plot and in the y-axis i'll have genre in the x-axis will have popularity and i'll keep a comma and using the data argument i'll say my data to be famous which is this variable that we created and then i'll set my title as top 5 genres by popularity all right now the reason why i took head of 10 is because there are a few genres which are repetitive so if you see this we have children's music appearing twice so hence we have taken 10 instead of 5 let me just go ahead and run it there you go so here if i scroll down you can see i have my top 5 genres based on the popularity so we have dance pop rap hip-hop and rigaton so these are the five genres which are most popular based on the data that we have collected from spotify all right hi everyone welcome to this really interesting video on data analysis of the 2021 world happiness report using python so today we will perform some exploratory data analysis using python libraries to analyze visualize and draw insights from 2021 world happiness data before i begin make sure to subscribe to the simply alone channel and hit the bell icon to never miss an update first let's understand what the world happiness report 2021 is all about the international happiness day is celebrated every year since 2013 on 20th of march to emphasize the importance of happiness in the daily lives of people so the united nations sustainable development solutions network published the world happiness report on 19th of march 2021 that ranks the world's 149 countries on how happy the citizens perceive themselves to be based on various indicators the happiness study ranks the countries on the basis of questions from the gallup world poll the results are then equated with other factors such as gdp life expectancy generosity etc this year it focused on the effects of the kovit-19 pandemic and how people all over the world have managed to survive and prosper so using this 2021 data we will answer critical questions such as the top 10 most corrupt countries we will plot a graph to understand how the happiness score is related to the freedom of making life choices we look at the life expectancy of 10 happiest and 10 least happy nations so these are a few examples but we will explore more about the data in detail in our demo session let's get started so first i'll show you the data set we'll be using in this demo so this data has been collected from kegel let me show you that so this is the csv data set that we have downloaded from kegel.com so you can see here world happiness report 2021 and we will share the data set link in the description of the video you can click on the link to download the data set now we have information about 149 countries you can see it here count is 149 let me go to the top and i'll run through the columns that are there in this data set so the first column is the country name so we have 149 different countries and then we have something called as regional indicator we can call this as just the region so you can see we have different regions i have applied a filter we have central and eastern europe then we have common wealth of independent states so these include countries such as russia then we have east asia latin america and caribbean we also have south asia southeast asia sub-saharan africa western europe and other regions i'll just cancel this and then we have the happiness score column that has been sorted in descending order so we have finland denmark and switzerland who are the top three happiest nations now if i scroll down we have countries like rwanda zimbabwe and afghanistan which are the least three happy countries now there are a few columns that we won't be using in our analysis so we will learn how to exclude those columns and keep only the relevant ones so columns such as standard error of ladder score then we have upper whiskers and lower whisker column so we are going to ignore these columns we are only concerned about the gdp column which is this one then we have the social support or the social status column then we have the health life expectancy freedom to make life choices generosity and perceptions of corruption and there are a few other columns you can see to the right and these columns are not of our interest so we are going to ignore them now let's head over to our jupyter notebook and we'll start by importing all the necessary libraries for analysis and data visualization okay so i am on my jupyter notebook so first step i'll just rename this notebook to let's say happiness report data analysis i'll click on rename all right now we'll start by importing our libraries so first library i'm going to import is numpy as np then i have import pandas as pd then i'll import two data visualization libraries cbon and matplotlib so i'll say import c bond as sns and then import matplotlib dot pi plot which is the module name as plt and i'll say percentage matplotlib inline okay now let me just go ahead and run this all right well now let's set the parameters that control the general style of the plots the style parameters control properties like the color of the background and whether a grid is enabled by default or not so for that i'll say sns dot set underscore style i'll give it as dark grid next i'll say plt dot rc per amps which stands for runtime configuration parameters i'm going to set my font size to let's say 15 then i'll say plt dot rc params now i'm going to set the figure size so i'll say figure dot fig size let's say 10 comma 7 next just copy this paste it here now we are going to set the face color so i'll say figure dot face color i want to set it to peach color so i'm going to pass in my rgb values for peach i'm going to set it in terms of hex code so for peach the value is f e five b and four now let me run it okay now it's time to load our data set so for that i'll create a variable called data and i'll use the pandas library followed by the read underscore csv function because our data set that we saw is a csp data set which is this one now inside the parenthesis i'll pass in the location of my data file so i have my data here world happiness report 2021 i'll just copy this location and we'll paste it here and make sure the location is within quotes and you need to change it to either forward slash or double backslash so here i'm using double black slash so let me just include one more backslash and then i'm going to pass in the file name which is world hyphen happiness hyphen report hyphen 2021 dot csv which is the extension of the file now let me run it all right now to display the first five rows of information you can use the head function so i am writing data which is my variable that holds the data prime so data.head there you go you can see here we have printed the first five rows from the data set you have the country name regional indicator happiness score then we have information about the gdp life expectancy generosity then we have corruption data and these are some of the columns that we are not bothered about so we are going to drop these columns from our analysis now we are going to do that so i will create a variable called data columns which are of our interest so i am going to take only specific columns i need the country name so i have taken country name make sure the column names are within single quotes so i have my country name next i want the second column which is regional indicator give a comma we also need the happiness score next i need the logged gdp per capita data so i'll take that column i'll say logged gdp per capita give a comma my next column would be social support give a comma here my next column would be health life expectancy so i'll write health life expectancy let's give another comma and we'll take the next column as well which is freedom to make life choices so i'll write that column name freedom to make life choices and finally we'll take the next two columns that is generosity and perceptions of corruption so within single quotes i'll say generosity give a comma and we're going to include the final column which is of our interest that is perceptions of corruption let me have a recheck to ensure that i have put the column names correctly otherwise it will throw an error now let me go ahead and run this cell i'll hit shift enter all right so we have successfully taken the columns that we'll be using for an analysis now i'm going to say data equal to data i'll pass in my new variable that is data underscore columns and i'm going to copy all the data so i'll say dot copy let's run it okay this should be data underscore columns all right now let's rename all these columns we'll make it more simpler and easy to understand so i'll say let's say my new variable is happy underscore df which stands for data frame equal to i'll say data dot we'll use the rename function and using a dictionary we will rename our columns so i have used a curly bracket i am going to pass in my first column which is country name i'll just paste it here then i'm going to give a colon and again within single quotes i'll say country underscore name so this is going to be my new column name i'll give a comma we'll take the next column which is regional indicator i'll paste it here give a colon and the new column would be small r regional underscore indicator now similarly we will do this for all the remaining columns in the data set okay now i have renamed all my columns you can see it here let's run it now we are going to display the head of the data set again so i'll say happy underscore df dot head there you go so we have only those columns that are of our interest so i have the country name regional indicator happiness core gdp social support life expectancy freedom to make life choices then i have generosity and perceptions of corruption cool now we are going to check whether any of the columns have any null values so for that i will say happy underscore df dot i'm going to use the is null function i'll give another dot and we are going to find the sum for each of the columns you can see from the data we do not have any null values in any of the columns in the data set dissolve zero okay now let's get started with our first visualization that is we'll create a plot between happiness score and the gdp for different regions so for that i'll give a comment as plot between and gdp i'll just scroll down cool first i'm going to set the rc parameters so i'll say plt dot rc per amps within square brackets i'm going to give my figure size so i'll say figure dot fixed size equal to let's say my figure size is 15 comma 7 i'll set the title that is plt dot title of my plot to let's say plot between happiness score and gdp next i'm going to say sns dot let's create a scatter plot so i'll say sns.scatterplot i'm going to define my x-axis and the y-axis for the plot let's say in the x-axis we have my data frame happy underscore df dot this should be an underscore i'll say a column name as happiness underscore score give a comma and in the y axis we'll have my data frame name that is happy underscore df dot will have the gdp column that is logged underscore gdp underscore per underscore capita let's give a comma and we'll pass in hue for the color let's see for you am going to use the original column or the regional indicator column so i will say happy underscore df dot regional underscore indicator and then i'm going to give the size of the dots as let's say 200 then i'll give a semicolon come to the next line i'll say plt dot let me just scroll down now we are going to define the legend so i'll say plt dot legend and in legend we'll have the location let's say i want to put the legend at upper left corner so i'll say loc which is for location equal to upper left give a comma and then say font size of my legend let's say be 10 make sure this is within quotes then i'm going to pass my x-axis labels and the y-axis labels so i'll say plt.x label let's say the x label is happiness score and my y label is gdp per capita so i'll say plt dot y label within single quotes i'll say gdp per capita there's an error here this should be plot all right so i have written my code to create a scatter plot let's run it and see the result there is some error here okay this should be a regional indicator and not indie catering there's a spelling mistake let's run it again all right so here you can see we have a nice scatter plot on the top you can see we have the title of the plot that is plot between happiness score and gdp on the x-axis we have the happiness score from 0 to 8 and above and on the y-axis you have the gdp per capita and if you see here in this region we have countries from western europe which have the highest happiness core and the gdp per capita is also the highest around this region you can see here which is for green and in the legend you can see green is for sub-saharan africa so all these countries have low happiness score and even the gdp per capita is also low and if you see the countries for latin america and caribbean you see a lot of the values lie here so they are all within the range of 5.5 to 7 in happiness score and even the gdp per capita is more than nine for most of them now even the happiness score is high for the countries that lie in the north america and enz region and even that gdp per capita is also the highest i can name a few countries such as australia new zealand we have canada and united states of america which belong to the north america and enz region cool now there is one country which you see here this seems to be like an outlier which means that this is the country which has the lowest happiness score and even the gdp per capita is also low but it is not the lowest because you can see here there are a few countries up here from the sub-saharan africa which have the lowest gdp per capita but the happiness score is higher than this value or this country so we can assume that this country is afghanistan which has the lowest happiness score as per the 2021 happiness report data cool now we'll plot a pi plot to understand the gdp by region so by this we can know which region has the highest percentage contribution to the world's gdp as per our data so for that i'll create a variable gdp underscore region equal to we'll use our data frame that is happy underscore df dot i'm going to use the group by function and after that i am going to use my column that is region so we have named the column as regional underscore indicator so i am going to group it by this region column and i am going to sum the values of gdp so i am going to use the logged underscore gdp underscore per capita column and after that i am going to use the sum function and let me just print gdp underscore region you can see it here we have the sum of all the countries for different regions and their gdp all total now this data we are going to plot it in the form of a pi plot so i'll say gdp underscore region dot plot dot pi we are going to plot it in terms of percentage so i am going to use a parameter called auto pct equal to then i'm going to pass in my format so i'll say percentage 1.1 f percentage percentage then i'll say plt dot title let's see the title of my pi plot is going to be gdp by region and i'll say plt dot by label which is going to be blank let's run it okay so here you can see we have the peach background at the back because we had assigned a peach face color you can see it here so for the first scatter plot also we had the peach color at the back and now you can see we have sub-saharan africa contributing 20.7 percent to the world's gdp the reason being we have around 34 countries in the sub-saharan africa and we have the western european countries contributing to 16.2 percent of the gdp to [Music] check the least we have north america and enz region because we only have four countries america australia canada and new zealand so hence they are contributing only 3.1 percent to the world's gdp okay now moving ahead let's find the total number of countries in each region so for this we are going to use the group by function that is part of the pandas library and we'll count the total number of countries in each region so i'll just give a comment as total countries all right just scroll down so i'll create a variable called total underscore country i'm going to use my data frame that is happy underscore df dot i'm going to use the group by function i'll group the values based on the region column so within single quotes i'll say regional underscore indicator and then i'm going to find the total count of country names so i'm using the column country underscore name and after that i'll just use dot count now let's go ahead and print my variable that is total underscore country all right now let me hit shift enter okay so here you can see all right so for sub-saharan africa there are total 36 countries and not 34 as i mentioned earlier so hence you can see because it has the highest number of countries it is contributing the most to the world's gdp that is 20.7 percent then we have the least number of countries in north america and anc that is only four we have six countries in east asia and then we have 20 countries in latin america and caribbean 12 countries in commonwealth of independent states then we have 17 countries in central and eastern europe cool now i'm going to show you how to create a correlation map so that we can see the relationship that exists between each of the variables that are present in our data set i'll just run through the code and we'll see the output okay so here i have my code written for the correlation map that i want to create so first of all i am going to compute the correlation matrix so i have used the corr function which stands for correlation and the method i am going to use is pearson method now here i have a wikipedia page opened for pearson correlation coefficient so this is a nice article where you can understand what the pearson correlation is all about and then i'm going to set up the matplotlib figures so i've used the subplots function and i've given the figure size as 10 comma 5 and after that we are going to draw the heat map with the mask so i am using the heat map function present in the c bond library and then i have passed in my variable that is cor which essentially stands for correlation and then i have used mask which is a boolean array or it can be a data frame and it is an optional parameter if it is passed the data will not be shown in cells where the mask is true and the cells with missing values are automatically masked after that i have used the c map parameter which stands for color map so you can customize the colors in your heat map and i'm going to create a heat map which is in square shape so i have said square equal to true and the axis i am going to pass it as ax which i have defined here so this is the matplotlib axis and it is optional so access in which you want to draw the plot otherwise you can use the current active access now let me just go ahead and run this and we'll see the heat map there you go so if i scroll down here we have the correlation matrix and since i had given my c map as blue and here you can see it is mostly blue and we have the scale here now i'll tell you how to read this correlation matrix so wherever the cells are in blue or dark blue color this means that the variables have very high correlation and all these cells where you see light blue color or grayish and white color this indicates that the variables have very low correlation for example there is very low or almost negative correlation between happiness score and perceptions of corruption so obviously if the citizens of a country feel that there is a lot of corruption in the country that happiness score would obviously be less or low and if you see there is also low correlation between happiness score and generosity there is then low correlation between health life expectancy and perceptions of corruption and if you see these places there is really high correlation between the happiness score and gdp per capita again for social support also there is very high correlation now if you see even for social support and healthy life expectancy there is very high correlation but there is low correlation between making life choices and generosity and there is negative correlation between corruption and freedom of making life choices you can see it is almost white color which means it falls around this region so negative correlation similarly for logged gdp per capita and corruption there is negative correlation all right okay now we are going to visualize a bar plot that will tell us the corruption in different regions so i'll just give a comment corruption in regions let me just scroll down okay so i'll create a variable called corruption which will be equal to my data frame that is happy underscore df now first of all i am going to use the group by function to group all my regions so i'll say regional underscore indicator which is my column name present in the data set and after this i'm going to find the average of the corruption that is perceptions of corruption so i am going to pass in my variable name that is perceptions of corruption and i'm going to use the mean function to find the average corruption in each of these regions now let me just go ahead and print my variable that is corruption if i run it you can see here i have the values for the different regions and from here you can see that central and eastern europe has the highest perceptions of corruption as per the questions answered in the poll and if you see the table we have western europe and the north america region with the least perceptions of corruption now we are going to visualize this using a bar plot now so first of all i'll set my parameters by giving the figure size i'll say fig size it's rather figure dot fixed size i'll just add figure here okay and now let's say i'll assign it as 12 comma 8 now i'll give a title to my plot so i'll say plt dot title and my title of the plot is going to be perception of corruption in various regions all right then i'm going to define my x label so i'll say x label will be regions and we'll set the font size of the x label to let's say 13 or let it be 15 then we are going to set the y label in y label i'll have corruption index as the label name again we are going to set the font size to 15 now i am going to use x ticks parameter since i want to rotate the axis labels in the x axis by 30 degrees so i'll say rotation equal to 30 and i'll say h a which stands for horizontal alignment equal to right make sure this right should be within single quotes and finally i'll say plt dot bar because i am going to plot a bar graph i'll say corruption dot index comma and then i'll say corruption dot perceptions of corruption which is my column name all right so i have my code ready for the bar plot now let me just go ahead and print the bar plot make sure everything is correct i'll just hit shift enter okay there is one mistake here it says okay this should be x ticks and not x tick let me run it again there you go so if i scroll down on the top you have the title which is perception of corruption in various regions on the x-axis you have the regions label on the y-axis you have the corruption index and if you see this as per our table that we created we have least corruption in north america and anc region then we have the next least corruption in western europe but we have the highest corruption in central and eastern europe as per their citizens perception similarly we have the second and third highest corruption in latin america and caribbean as well as south asia cool now moving ahead i'm going to show you how you can find the life expectancy of the top 10 happiest countries and bottom 10 happy countries so for that i'm going to run you through the code and we will see the visualization side by side okay so i have my code written in these two cells so first i am going to find out the top 10 happiest countries and then i'm going to find the bottom 10 happiest countries so for that i'm using the head function and i passed in 10 since i want the top 10 country names and to get the bottom 10 countries as per their happiness score i am using the tail function so let me just run it so we have saved the result in two variables top underscore 10 and bottom underscore 10 and to create the bar plots we are using two different codes so here you can see i have set my figure size and access and then i have my x label as country name after that i am setting my title of the bar plot to top 10 happiest countries life expectancy i've used my x stick labels and i'm rotating it by 45 degrees and i have my horizontal alignment as right you can see it here we have used the bar plot function i have the x-axis as country name and y-axis as healthy life expectancy column and i have set my axis then i am using the x labels and y labels as country name and life expectancy and similarly i have my bar plot for the bottom 10 least happy countries life expectancy let's just run it and we'll see the result there you go if i scroll down you can see we have two different bar plots the first one is for the 10 happiest countries and then we have the bottom 10 least happy countries so if you see on an average the top 10 happiest countries life expectancy is above 70 years so if you are from one of these countries you are expected to live for more than 70 years now if you check the bottom 10 least happy countries you see here lesotho has less than 50 life expectancy age and most of them are less than 60 years so if you are from one of the top 10 happiest countries you are expected to live 10 years more than these countries that line the bottom 10 region cool now moving ahead all right so now we are going to see the plot between freedom to make life choices and the happiness score for this i'm going to use a scatter plot so i'll first define my figure size so i'll say plt dot rc p-a-r-a-m-s which stands for parameters i'm going to pass in my figure size so i'll say figure dot fig size equal to let's say 15 comma 7 then using the c bond library and the scatter plot function will pass in the x-axis let's say the x-axis is my data frame name happy underscore df and i'll have the freedom to make life choices in the x axis so the column name is freedom underscore 2 underscore make underscore life and choices i'll give a comma and we'll pass in the y-axis again i'm going to use my data frame name dot and i'll have the happiness score i'm also going to pass in a hue parameter to differentiate the different regions so i'll say df dot regional underscore indicator and i'll give the size of the dots or the bubbles as let's say 200 now i'll say plt dot legend i'll place the legend at upper left corner so i am giving the location which is loc equal to upper left and font size let it be 12. i'll say plt dot x label as freedom to make life choices and my y label would be happiness score all right now we have the code written for the scatter plot that i want to create let me just run it there you go so here we have the legend for the different regions and we have the different colors for each of these regions and you can see here i have on the x axis freedom to make life choices and on the y-axis we have the happiness core so you can see it very clearly for the countries that lie in the western europe region the blue dots the freedom to make life choices is more and so is the happiness score then if you see the values in the green region which is for middle east and north africa the freedom to make life choices is lower and hence the happiness score is also low and for all these countries that are part of the sub-saharan africa some of them have decent score for freedom to make life choices but the happiness score is low again now if you focus on the pink dots which is for the southeast asian countries the happiness score is comparatively lower but the freedom to make life choices is more than 0.8 cool and again we have one data point which is lying at the bottom we can assume this is afghanistan so in afghanistan the freedom to make life choices is very low and even the happiness score is really low cool now moving to our next analysis we are going to see the top 10 most corrupt countries so first i am going to sort the perceptions of corruption column and find out the top 10 countries in the list so for that i am going to create a variable called country i'll have my data frame name that is happy underscore df dot sort underscore values by i am going to sort by column that is perceptions of corruption dot head and i want to find the top 10 most corrupt countries as per the poll and then i'm going to pass in the figure size so i'll say plt dot rc per amps and then i'm going to set the figure size so i'll say figure dot fig size make sure there is no spelling mistake let's say the figure size is 12 comma 6 i'll pass in the title so i'll say plt dot title let's say the title of my plot is going to be countries with most perception of corruption after that i'll say plt dot x label in the x label i'll have country and i'll set the font size to 13 then i'll say plt dot by label in by label i'll have the corruption index next i'll pass in the font size for the y label again it is going to be 13. now i will say plt dot x ticks i am going to rotate it by 30 and i'll say horizontal alignment equal to right and then i'll have my bar function or the bar plot function so i'll say plt dot bar i'm going to pass in country dot country underscore name and then we'll have a variable country dot the column name that is perceptions of corruption all right now if i run it you can see here so these are the countries with the least perceptions of corruption so singapore has the lowest corruption index and then we have rwanda denmark finland now if you want to see the countries with the highest perceptions of corruption you need to change this head to tail of 10 since we are sorting it in ascending order so we would want to know the bottom 10 countries if i run it there you go so these are the countries with the most perception of corruption you have slovakia lesotho kosovo this ukraine afghanistan bulgaria romania and croatia all of these countries have a corruption index of more than 0.85 cool now coming to the final section of this interesting video on happiness report data analysis for 2021. i want to visualize a scatter plot that will tell us how the corruption varies in terms of happiness score so i'll just say a comment as corruption versus happiness okay so first of all i'll set my figure size so i'll say plt dot rc params within single quotes i will give the figure size as 15 comma 7 then i am going to use the scatter plot function that is part of the c bond library in the x axis i will have my column that is corruption or rather we'll have the happiness score in the x axis so i'll say happy underscore df dot column name that is happiness score let me give a comma and in the y axis we'll have the corruption column so i'll say happy underscore df dot perceptions of corruption and then in hue we'll have the region that is regional underscore indicator so i'll say happy underscore df dot regional indicator and i'm going to give the size of the dots so i'll say s equal to 200. then i'm going to say plt dot legend i'm going to put it at lower left corner this time so i am giving my location as lower left and the font size of the legend i want is 14. after that i'll just say plt dot x label as corruption and plt dot by label as happiness core now this should rather be the opposite since in the x-axis we have taken happiness score so we'll put the happiness code label in x and i'll have corruption in y all right so i have my scatter plot code ready let me just run it there you go so here you can see on the lower left i have the different regions and on the x-axis i have the happiness core on the y-axis i have the corruption index now you can see the general trend as per the scatter plot that the countries with greater happiness score have lower corruption index so you can see these countries which are from the western european regions have highest happiness core and the corruption is also really low so all these blue countries you can name a few finland we have sweden we have belgium france netherland lying in these regions we also have a few countries from north america and enc which are essentially australia then we also have canada u.s and new zealand okay now if you focus on these green region this region now these are the countries from the sub saharan africa and most of the countries from these regions have a very low or less than five happiness core but the corruption is also really high you can see here they have almost more than 0.7 in the corruption index now you also have a country here which is from the southeast asia we would like to give you a task it would be really great if you can tell us in the comments section which country is this it has more than six happiness core but the corruption index is really low which is less than point two now if you consider middle east and north african regions the darker green countries here also if you see the happiness score is less than 5 and the corruption index is also high now there are a few countries the common wealth of independent states for example countries such as ubikistan then you also have kajikistan we have russia armenia belonging to these grey color dots you also have georgia and ukraine as part of commonwealth of independent states so here you see the corruption is below 0.6 for these countries and the happiness score is more than 5. all right so with that we have come to the end of this demo session on world happiness report 2021 data analysis using python hello everyone welcome to this exciting video on olympics dataset analysis using python by simplyloan this is going to be a very interesting and interactive session where i'll give you a brief about the history of olympics and throw some light into the recently held tokyo olympics but we'll focus more on using olympics dataset that is available on kegel and perform some exploratory data analysis you will understand how to use different functions in python to analyze and extract meaningful information during the course of our discussion i'll be asking you a few general questions related to the olympics try to answer them in the comments section of the video so let's begin the olympics is one of the biggest sporting events on the planet it is held every four years the first modern olympics took place in athens greece in 1896. as per national geographic the original olympics took place in 776 bc so they began as part of an ancient greek festival with celebrated jews the greek god of sky and weather the rings in the olympics logo represent the five continents europe africa asia the americas and oceania from 1924 to 1992 the winter and the summer olympics took place in the same year but now they alternate every two years before i move on here is an interesting question for you only two people have ever won gold medals at both the summer and the winter olympics who are those two people please share your answers in the comments section of the video we would like to know from you the summer olympics in tokyo began on the 23rd of july and recently concluded on the 8th of august we got to witness some thriller matches that went down to the wire some amazing victories and sadly there were a lot of heartbreaks as well winning and losing are part and parcel of any game fans across the world were really happy to see this global event happen this year following last year's postponement due to the coronavirus pandemic with our olympic fever high let's take this opportunity to work around a project to perform exploratory data analysis using python to analyze and visualize past olympics data and answer specific questions in this video we will be using specific python libraries such as numpy pandas matplotlib and c-bond to make sense of our data and extract meaningful information we are going to visualize our results with different charts and graphs before i show you the data set that we are going to use for our analysis i have another general question for you which is the first olympics where all the participating countries sent female athletes a repeat which is the first olympics where all the participating countries sent female athletes put the year and the city name where the olympics was held in the comments section of the video we'll be more than happy to hear from you now let me go ahead and show you the data sets that we'll be using in this video so here you can see i have my olympic data set folder that has two data sets these are csv files one is called the athlete underscore events and the other one is called noc underscore regions so we'll be using these two data sets these data sets can be downloaded from kegel we will post the data set links in the description of the video please go ahead and download them these data sets have been directly taken from the internet now i'll show you the data sets so this is my first data set you can see on the top athlete underscore events so i have around 15 columns you can see the count here and there are nearly 2 lakhs 71 116 rows of information so this is a huge data set that we are going to use the other data set we are going to use in this demo is called noc underscore regions now here one thing to notice noc stands for national olympic committee this is a three letter code that is given by the olympic committee we have the region names so afg is for afghanistan you have alb for albania elg for algeria we have also a column called nodes you find some notes about the region for example here antiqua is actually antiqua and barbuda all right let's move to the first data set that i showed you this is a primary data set for our demo so these data sets have been directly taken from the internet and were not validated so the results that you will see in the demo is purely based on the data that we have collected so the file athlete underscore events.csv contains nearly 2 lakh 71 116 rows of information and there are 15 columns so each row corresponds to an individual athlete competing in an individual olympic event so here id is actually a unique number for each athlete then we have the name which is basically the athlete's name we have the sex or the gender which is male or f or female then we have the age of the athlete which is in terms of integers we have the height in centimeters of the athletes then we have the weight in kilograms we have the team name these are the country names you have china denmark netherland there are around 200 country names then we have the noc as i said noc is a three letter code that stands for national olympic committee we have the games this games contains the year and the season you can see here 1992 summer you also have the winter olympic information all right then we have a specific year column to tell you in which year this event had occurred now the year column is also an integer column we have the season so whether it was summer olympics or winter olympics then we have the city name this is the host city then we have the sport name you can see here we have basketball judo football tug of war speed skating there are so many then we have another column called event which is the complete even name as you can see the sport name is basketball but the even name is basketball men's basketball to be very specific again for speed skating you have different categories like women's thousand meters 500 meters all right now the final column is the middle column so it has the information about the athlete whether the athlete had won a medal bait gold silver branch and n a means the athlete did not win any medals so we'll use these two data sets now let's get started with the demo we'll be using jupyter notebook for our analysis so i'll take you to my jupyter notebook right away i've opened it on chrome so this is my jupyter notebook that we are going to use you can see here olympics data set analysis i have already have a few cells that have been filled with some piece of code and you can see there are some comments written as well we are going to use this data set so let's get started first and foremost we'll import the data sets so we are going to use numpy pandas matplotlib and cbon let me hit shift enter to import all the libraries all right now the next thing is to load the datasets using pandas read underscore csv function so i'll create a variable called athletes i'll say equal to pd dot i'll use the read underscore csv function and inside this function will give the location of the files so i'll copy this location and here i'll paste make sure this is within quotations and these are all forward slash give another forward slash and after that i'll give the file name that is athlete underscore events dot csv i'll close the quotation let's run it okay similarly we will load the second data set i'll create another variable called regions so it will have my data frame i'll use the same function pd dot read underscore csv i'll just copy this file path till here we paste it and i'll give my second file name which is noc underscore regions dot csv you can verify here the first one is athlete underscore events the second one is noc underscore regions okay i'll close the quotation and we'll run it again okay there seems to be some error i think this is the regions cool now moving ahead let's see the first few rows of both the data sets i'll use the head function for that i'll say athletes dot head so this will print the first five rows from my data set athletes you can see here from 0 to 4 you have 5 rows of information the id name sex or the gender each height weight if i go to the right you have games year season sport event and middle now let's see the second data set that we imported or loaded i'll say regions dot head it says regions is not defined let me cross check it okay i have taken it as region let's make it regions cool now let's run it there you go so these are the first five rows from our second data set you have noc region and some nodes all right now the next step is to combine both the data sets so we'll join the data frames using the pandas merge function let me show you how to do it so this is going to be a horizontal join i'll create a variable called athletes underscore df then i'll use my first variable which is athletes i'll say dot merge which is my function name i'm going to merge the regions data frame i'll use the how parameter in how i'll give my value as i want to do a left join and my common column in both the data sets is actually noc so i'm going to use the noc column as my common column to merge both the data sets now let's go ahead and print the most data frame i'll say athletes underscore df.head let's run it there you go so if you see clearly if i move to the right you can see here we have added two more columns that were present in the second data set which was regions so we have regions and nodes added to my first data set that is athletes underscore events cool now one thing to note here is see the column names are not consistent if you see here the rest of the columns start with a capital letter but if you see the last two columns that we just added now start with the lower case letter so we'll use the rename function to make the column names consistent but before that let's see the shape of the data frame so i'll print the shape of the data frame to know the total number of rows and columns i'll say athletes if i hit tab you will see it will give me the prompt i'll select athletes underscore df i'll use the shape attribute let's run it and see the result you can see here it gives me the total number of rows so 2 lakh 71 116 rows and earlier we had 15 columns in the first data set and now that we have added two more columns so the total number of columns becomes 17 now cool so now it's time to make the column names consistent so i'll show you how to do it i'll say athletes underscore df dot i'll use the rename function and then i'll say columns equal to i'll use curly braces my first column that is region i want to change it to a region but the r should be capital now or upper case i'll give a comma and then see my second column this should be a region actually my second column that is notes i want to change the first letter to uppercase so i'll say capital n the rest all remains same i'll give a comma here and say in place equal to true so it will change the column names and will reflect in this athletes underscore df data frame let's run it now to verify you can again use the head function i'll run this again let me scroll down and move to the right you can see the difference there you go so we have successfully renamed the last two columns region and notes all right now moving ahead now i'll show you how you can use the info method to print information about a data frame including the index data type the column data types normal values and memory usage so for that you just need to use the info function so i'll write athletes underscore df dot info let's run it there you go so here we have the data columns you can see total 17 columns you have two lags seventy one thousand hundred and sixteen rows or entries then you have the different column names and you see it says non-null and the data type of the column below you can see some information about the memory usage all right next the describe method is used for calculating statistical information such as mean standard deviation the percentiles of the numerical values of the data frame and much more it analyzes both numeric and object series and also the data frame column sets of mixed data types so let me show you how using the describe function you can display some statistical summary i'll just write describe and give parenthesis let's run it okay if i scroll down you can see here by default the describe function will only give information about numerical columns so here you can see we have the total count the mean of each of these columns you have the standard deviation the minimum and maximum value and then we have the 25th percentile the 50th percentile and the 75th percentile value of the columns id age height weight and year all right now one thing to notice here is if you see the year column the minimum year is 1896 so this is when olympics started and until recently the rio olympics that was held in 2016. all right now moving ahead now let's check if there are any null values present in the columns of the data set so first i am going to create a variable called nan values give equal to i'll see athletes underscore df and the function to check is i'll say is an a i'll create another variable called nan underscore columns and here i'll use the any function i'll say nan underscore this variable which is underscore values dot i'll see any and i'll print my nan columns variable so this will display the result in terms of boolean values so if there are any null or any n or missing values in any of the columns it will say true otherwise it will say false let's display the result there you go so if you mark here clearly so there are nearly six columns where we have missing values you have age height wheat metal region and notes columns that have missing values so hence it has given us true the rest of the columns do not have any any n or missing values hence they are false all right let me scroll down now let's see the total number of null values for the above six columns so i'll say athletes underscore df dot i'll use the function is null and again i'll use this sum function okay this should be is null if i run it you can see it here so our age column has 9474 rows where we have null values then we have some null values for the height column as well for the weight column so these are the total number of rows where we don't have any information regarding each height weight or region notes this medal is self-explanatory because a lot of the athletes who participate in olympics don't win any medals so for them the value is any now before i move ahead i have a question for you people so the question is i want you to print the column names containing null values or missing values in the form of a list so please answer this question and put it in the comments section of the video we would be happy to know your approach so the question i'll repeat it again i want you people to print the column names containing null values or missing values in the form of a list so basically on the top we saw there were six columns that had null values i want you to print these six columns in the form of a list all right now moving ahead now let's see the data for specific countries let's say you want to see the athletes who have participated in the olympic games from the beginning for india for that you can filter your result using a function called query so let me show you how to do it i'll say athletes underscore df dot will use the query function and then i'll say team which is my column name should be equal to equal to and the value i'll give is india now let's display the first five rows of information or the details for the athletes who are from india if i run this okay there is some error here okay so we need to make sure this entire expression or the condition should be within single quote i'll just delete this single quote from here and we'll add it in the end now let's run it there you go so here you can see we have the top five rows of the athletes who are from india you can see here the region is india you can see the noc's india the team is india and let me go to the left you can see the name of the athlete the sex each so for each you have some missing values you have to wait team year and everything all right similarly you can also check for japan let me just copy the above code i'll paste it here and instead of india i'll say japan i'll run it there you go so if you see here you can see all the details are for the athletes who are from japan all right moving ahead now i want to know the top 10 countries who have participated since the inception of olympics in 1896. so for that i'll create a variable called top underscore 10 underscore countries and say equal to i'll use my data frame that is athletes underscore df i'll say dot team then i'll use the function value underscore counts i'm going to sort my count in ascending order just mark the way in which i am writing the functions any error in terms of syntax or the flow will throw you an error i want to display the top 10 countries now let's go ahead and print it i'll say top underscore 10 underscore countries all right let me run it okay it has given us an error let's debug it okay so this should be value counts and not count let's run it again another error okay so this should be short underscore values and not value let's run it again there you go so here you can see the top 10 countries participating in olympics since 1896 you have united states then we have france so these are the number of participants who have taken part since 1896. so the most number of participants have come from us then we have france great britain italy germany canada japan sweden we also have australia and hungary i'll scroll down now we are going to convert this table that we got or the output that we got in the form of a graph so i am trying to create a bar plot to plot the top 10 countries who have participated in olympics so here i'm using the figure function to give the figure size so these are the dimensions 12 comma 6. now i'll tell you what this plt x6 means just wait for a while i have given a title to my plot saying overall participation by country and then using the c bond library and the bar plot function i am plotting the x and the y axis so my x axis has top i should make this top 10 countries because above we used top underscore 10 underscore countries similarly for y-axis i'll make it top underscore 10 underscore countries and then i'm using a color palette called set 2 let's run it there you go so if i scroll down you can see here we have a nice bar plot this is a vertical bar plot and here you can see the bars which represent the different country names first we have united states which has the highest participation since the beginning of the olympics then we have france great britain italy germany canada and we have the rest of the countries cool now moving ahead the next visualization we are going to see is the age distribution of the athletes so for this we are going to create a histogram i'll be using the matplotlib library specifically for this visualization so i'll say plt dot figure we'll give the figure size using the argument fig size equal to and this will be a tuple actually my size is going to be 12 comma 6 i'll say plt dot title now you can give any title that you want i'll say age distribution of the athletes then we'll also give the x labels i'll say plt dot x label this will have each then we see plt dot y label and this will have the label as number of participants next i'm going to use the hist function so i'll say plt.test and i'll pass in my data frame that is athletes underscore df dot age which is my column name comma i'm going to pass in the bins for that i'm going to use the np dot e-range function so mp is for numpy and inside this i'll say 10 comma 80 comma 2 so my bins will start from the value 10 and go until 80 with a size of 2 comma i'll give another parameter that is going to specify the color of the bins i want my color to be orange and then i'll use the edge color to separate the bins i'll say edge color this is the parameter or the argument name and the color i want is white let's give a semicolon here and this should be edge color there's a spelling mistake here now let's run it and we'll see the output so there you go so this is a nice histogram that shows the distribution of the age of the athletes you can see the title each distribution of the athletes on the y-axis you have the number of participants on the x-axis we have the age values ranging from 10 to 80 with a size of 2 and if you say this we have most number of athletes who have age between 20 to 30 you can see here so early 20s we have maximum number of athletes participating in the olympics we also have a few athletes who are beyond 40 years of age you can see here we have a few athletes even closer to 60 also and similarly we have a few athletes who are under 18 years of age all right now you can see here the bins we had taken orange color so are all the bars are represented in orange color and the edge you can see here we have the white color all right now moving ahead now in the initial slides of the video we discussed the summer and winter olympic games let's look at the different sporting events that are part of the summer and winter olympic games just to give you a heads up the winter olympic games are held once every four years for sports practiced on snow and ice so here i have my variable name winter underscore sports so i'm extracting for season equal to equal to winter and i am going to display only the unique values now let me run this cell okay so you can see here these are the different windows ports that are held during the winter olympics now similarly let's see for summer olympics so what i'm going to do here is i'll just copy this cell of code and we'll edit the variables i'll say summer underscore sports athletes underscore df remains the same here season will change it to summer dot sport dot unique remains the same and here i'll say summer underscore sports that should be small now let's run it there you go so here you can see we have more number of olympic sports that are held during the summer olympics compared to winter olympics so these are the sports that are played on snow and ice and these are the sports that are played during summer moving ahead now it's time to analyze the total number of male and female participants who have taken part in different games since 1896 till 2016 rio olympics all right so i'll say gender underscore counts which is my variable name i'll use my data frame that is athletes underscore df dot my variable name that is sex dot value underscore counts and then we are going to print gender underscore counts let's run this there you go so since the inception of olympics we have more number of male participants than female participants now here in this current cell you can see we are trying to plot a pie chart for male and female athletes so i have given my figure size then i have my title as gender distribution i am using the matplotlib library and the pi function i am using the variable gender counts that we used here then i am giving my labels as gender underscore counts dot index now i am using this auto pct the auto pct parameter enables you to display the percent value using python string formatting so this is my string formatting that i have used and i'm initialized my start angle for the pie chart as 150 degrees and i'm also giving shadow to my pie chart so i have written shadow equal to true let's run it all right so you can see here this is my pie chart that shows you the distribution of the male and female participation so for meal it is 72.5 percent for female so far it is 27.5 percent as per the dataset that we have you can change this start angle to let's say 180 degree and it will change the pie chart to this direction cool now moving ahead this time we're going to find the total number of medals that the athletes have won so i'll use my data frame athletes underscore df dot and then my column that is middle i'll say dot value underscore counts let's run it there you go so you can see the gold medals the bronze medals and the silver medals are very much similar to each other the numbers are pretty much the same all right now it's time to focus on the total female athletes who have taken part in each olympics so i'll create a variable called female underscore participants i'll say equal to and then use my data frame name that is athletes underscore df using square brackets i'll say athletes underscore df dot the gender should be equal to equal to female and i'll use ampersand i'll say just copy this paste it here lc dot season equal to equal to we'll check for summer olympics and i want to extract only the gender column so here it is sex comma my year column which is here okay and then i'll say female underscore participants equal to i'll just copy this paste it here and then i'll use the group by function say group by which is part of the pandas library i'll say year and then i want to find the count i'll reset the index say reset underscore index is a function and finally we are going to print female underscore participants and let's say the head all right so what we are trying to do is i am trying to filter my data only for the female athletes for summer olympics and here i am printing the participation based on each year so i am finding out the count so here we need to add one more square bracket all right now we'll run it and see the result the yuko so you can see here from 1900 1904 all these years you can see the female participation let me change it to tail so that we have the recent data of the olympics you can see it here for the beijing olympics 5816 athletes participated in 2012 london olympics we had 5815 olympics similarly for the 2016 rio olympics we had more participation than the london olympics so six thousand two hundred and twenty three women athletes had participated all right now here this is another way in which you can filter your data for female athletes so i'm using two parameters my gender should be equal to female and season is summer olympics let me just run it i have stored it in a variable called women olympics now here i am trying to find a count plot or i am trying to create a count plot using my cbon sns library i have set my style to dark grid then my figure size i have given as 20 comma 10 my count plot will have the x axis as year the data is women olympics and palette i'm using as spectral and then i have given a title women participation let's run and see the output there you go if i scroll down you can see on the top i have the title for my count plot which says women participation and on the y-axis i have the count and at the bottom you can see the different year values from 1900 till 2016 and 2016 had the highest number of female participation cool let me scroll down so in this cell of code we are trying to plot a line graph let me run it and show you the output so this line graph shows the trend which says plot for female athletes over time so here you can see the line graph so gradually the women participation has increased in olympics since its inception there was a slight decrease here this is around 1950s and again you can see here there is a decrease with the women participation in 1980 but since then since 1980 there has been continuous increase in the participation of female athlete numbers cool now coming to the next section of our analysis now we are going to filter the data to see the details of the athletes who have won gold medals so i'll create a variable called gold medals it's equal to i'll use my data frame as athletes underscore df and within parenthesis i'll use my data frame name which is athletes underscore df dot i'll use the condition middle equal to equal to and my value would be gold next i'm going to say gold medals dot head let's run it there you go so here we have the top five rows of the athletes who have won a gold medal you can see all the records have medal as gold now we are going to use this subset of the data and perform some more analysis so here i want to take only the values that are different from nan so i'm going to use is finite function and inside that i have set goal middles age let me just run it cool now i want to see the athletes who have secured a gold medal beyond the age of 60 years which is very rare so we'll see the total number of athletes with more than 60 years of each having won a gold medal so here i'm going to say gold medals i'll use my column as id and then i'll use my variable gold medals and then i'll say age should be greater than 60 and then close my square bracket and say count okay let me just verify if everything's fine yeah i'll run it and we'll see the result so there are total six athletes who have won a gold medal having age more than 60 years cool now let me check for which sport these six gold medals have come so i'll say sporting underscore event this is going to be my variable name equal to i'll use the same variable name gold medals i'll going to choose my sport column and then i'll give my condition age greater than 60. now let's go ahead and we'll print the variable spotting event if i run this you can see it here so for art competitions even for archery and suiting and this one called rook we have had gold medals from athletes who had an age more than 60 years now we are going to plot this result or the table that we achieved on the top so i'll say plt dot figure i'll give my figure size equal to let's say 10 comma 5 then i'll say plt dot we'll use the tight layout say tight underscore layout then i'm going to use the c bond library and create a count plot for my variable sporting underscore event then i'll say plt dot title i'll give the title as gold medals for athletes over 60 years of age let's run it and if i scroll down you can see here we have a nice count plot i have my title gold medals for athletes over 60 years and since for archery we had three players who secured a gold medal having an age more than 60 years you can see here archery is three at competition one rook is one and shooting we had one medal all right now we'll see the total gold medals from each country so for that i'll say gold medals dot we'll use the region column and followed by that i'll use the value underscore counts function and i'll reset the index with name equal to i'll provide the middle column then we are going to print the head for the top five countries let me run it you can see it here so usa has secured the most number of gold medals then we have russia germany uk and then we have italy cool now in this cell i am going to create a plot to visualize the table that i got above so i've used my labels my titles i'm using the cat plot function that is present in the c bond library i have my x axis my y axis the data which is total gold medals let me just verify okay and we'll run it if i scroll down you can see this is a different palette that i had used which is a rocket and you can see usa has got the most number of gold medals then we have russia germany uk italy and france now we also have included france here which was not present in this table the reason being i have used head as six so it will give me the top six countries all right now we will analyze and look at the data for the most recent olympic event that is present in our data set i'm talking about the 2016 rio summer olympics so first of all i'll create a variable called max year which will be equal to my data frame name athletes underscore df my column name here and i'll use the max function then let's go ahead and print my max here i'll just show you the output you can see here my max here present in the data set is 2016 the rio olympics next we will create another variable called team underscore names that will be equal to my data frame name athletes underscore df within square brackets i'll give parenthesis and then say athletes underscore df dot year should be equal to equal to my output that is max here which is actually 2016 and then i will say ampersand i'll just copy this we'll paste it here this time i'll use the middle column and c equal to equal to gold i'll say dot team then i'll say team underscore names dot value underscore counts dot head and will display the top 10 countries let me run this there you go so in geo olympics united states secured the most number of gold medals now the reason this is 137 is we have also counted the team events for example basketball similarly great britain had 64 gold medals in total russia 50 we have brazil 34 argentina 21 france 20 and japan 17. all right now using the above result we are going to create a bar plot i'll just change this team underscore list to team underscore names similarly here also i'll make the corrections as team underscore names cool now we'll run it and you can see here we have a nice horizontal bar plot so we have united states at the top great britain russia germany china brazil so here i am displaying the top 20 nations since i have used head as 20 so these are the top 20 nations who secured the most number of gold medals in the u. olympics all right now in the final section of this video we will create a scatter plot to visualize the height and weight of male and female athletes who have won a medal now it can be a gold medal a silver medal or a bronze medal but before that we need to filter the data only for athletes who have won a medal now if you had noticed here our medals column also had a few null values so we are not going to consider those null values here we are only going to consider for the athletes who have one a medal so for that i will create a variable called not underscore null underscore medals equal to i'll say athletes underscore df then i'll give my let's make it df i'll give my condition let me just copy this i'll paste it here and then i'll give my condition height and use the not null function and then i'll say ampersand create another parenthesis i'll say athletes underscore df this time we'll consider the weight column again i'll use the function as not null and i'll close the parenthesis okay i had missed one s here cool let me just run it now all right now we are going to create our final plot i'll say plt dot figure i'll give my figure size equal to i'll say 12 comma 10 i'll say axis equal to sns dot we'll be creating a scatter plot to plot the height and weight of the athletes who have won a medal i'll say scatter plot in my x-axis i'll have height comma in my y-axis i'll have weight comma my data should be not underscore null underscore medals so this is what we created here not underscore null underscore medals comma i'll use hue equal to my sex column cool finally i'll say plt dot i'll give a title to my plot saying height versus weight of olympic medalists cool now let's see our scatter plot this might take some time to keep the result because there are so many athletes present in a data set who have won a medal and also we are filtering in terms of hue which is for male and female there you go so if i scroll down you can see here we have a nice scatter plot on the top you have the title of the plot saying height versus wheat of olympic medalist you can see our age as the hue so blue points are for the male athletes and orange points are for the female athletes and on the y-axis we have the weight in terms of kilograms and you have the height in terms of centimeters cool so that brings us to the end of this demo session on olympics data set analysis so we used two datasets to [Music] carry out this exploratory data analysis and we created some visualizations to analyze and visualize the data all right hello and welcome to data analytics interview questions my name is richard kirschner with the simply learn team that's www.simplylearn.com get certified get ahead today we're going to jump into some common questions you might see on numpy arrays and pandas data frames and the python along with some excel tabloo and sql let's start with our first question what is the difference between data mining and data profiling it's real important to note that data mining is a process of finding relevant information which has not been found before it is a way in which raw data is turned into valuable information you can think of this as anything from the cells stats and from their sql server all the way to web scraping and census bureau information where the heck do you mine it from where do you get all this data and information then we look at data profiling is usually done to assess a data set for its uniqueness consistency and logic it cannot identify incorrect or inaccurate data values so if somebody has a statistical analysis on one side and they're doing their you might in the wrong data to then program your data set up so you got to be aware that when you're talking about data mining you need to look at the integrity of what you're bringing in where it's coming from data profiling is looking at it and saying hey how is this going to work what's the logic what's the consistency is it related to what i'm working with find the term data wrangling and data analytics data wrangling is a process of cleaning structuring and enriching the raw data into a desired usable format for better decision making and you can see a nice chart here with our discover it we structure the data how we want it we clean it up get rid of all those null values we enrich it so we might take and reformat some of the settings instead of having five different terms for height of somebody you know in america in english or whatever clean some of that up and we might do a calculation and bring some of them together and validate i was just talking about that the last one need to validate your data make sure you have a solid data source and then of course it goes into the analysis very important to notice here in data wrangling eighty percent of data analytics is usually in this whole part of wrangling the data getting it to fit correctly and don't confuse that with data cooking which is actually when you're going into neural networks cooking the data so it's all between zero and one values what are common problems that data analysts encounter during analysis handling duplicate and missing values collecting the meaningful right data the right time making data secure and dealing with compliance issues handling data purging and storage problems again we're talking about data wrangling here eighty percent of most jobs are unwrangling that data and getting it in the right format making sure it's good data to use number four what are the various steps involved in any analytics project understand the problem we might spend 80 percent doing wrangling but you better be ready to understand the problem because if you can't you're going to spend all your time in the wrong direction this is probably the most important part of the process everything after it falls in and then you can come back to it two data collection data cleaning number three four data exploration analysis and five interpret the results number five is a close second for being the most important if you can't interpret what you bring to the table to your clients you're in trouble so when this question comes up you probably want to focus on those two noting that the rest of it does eighty percent of the work is in two three and four while one and five are the most important parts which technical tools have you used for analysis and presentation purposes being a data analyst you are expected to have knowledge of the below tools for analysis and presentation purposes there's a wide variety out there sql server mysql you have your excel your spss which is the ibm platform tableau python you have all these different tools in here now certainly a lot of jobs are going to be narrowed in on just a few of these tools like you're not going to have a microsoft sql server mysql server but you better understand how to do basic sql polls and also understanding excel and how the different formats for column and how to get those set up number six what are the best practices for data cleaning this is really important to remember to go through this in detail these always come up because eighty percent of most data analysis in cleaning the data make a data cleaning plan by understanding where the common errors take place and keep communications open identify and remove duplicates before working with the data this will lead to an effective data analysis process focus on the accuracy of the data maintain the value types of data provide mandatory constraints and set cross-field validation standardize the data at the point of entry so that is less chaotic and you will be able to ensure that all the information is standardized leading to fewer errors on entry number seven how can you handle missing values in a data set list wise deletion in list wise deletion method entire record is excluded from analysis if any single value is missing sometimes we're talking about records remember this could be a single line in a database so if you have your sql comes back and you have 15 different columns every one of those has a missing value you might just drop it just to make it easy because you already have enough data to do the processing average imputation use the average value of the responses from the other participants to fill in the missing value this is really useful and they'll ask you why these are useful i guarantee it if you have a whole group of data that's collected and it doesn't have that information in it at that point you might average it in there regression substitution you can use multiple regression analysis to estimate a missing value that kind of goes with the average imputation input regression model means you're just going to get you're going to actually generate a prediction as to what you think that value should be for those people based on the ones you do have multiple imputation so we talk about multiple inputs it creates plausible values based on the correlations for the missing data and then average the simulated data sets by incorporating random errors in your predictions what do you understand by the term normal distribution and the second you hear the word normal distribution should be thinking a bell curve like we see here normal distribution is a type of continuous probability distribution that is symmetric about the mean and in the graph normal distribution will appear as a bell curve the mean median and mode are equal that's a quick way to know if you have normal distribution is you can compute mean median and mode all of them are located at the center of the distribution 68 of the data lies within one standard deviation of the mean 95 of the data falls within two standard deviations of the mean 99.7 of the data lies within three standard deviations of the mean what is time series analysis time series analysis is a statistical method that deals with ordered sequence of values of a variable of equally spaced time intervals time series data on a coveted 19 cases and you can see we're looking at by day so our space is of days and each day goes by if we take and graph it you can see a time series graph always looks really nice we have like two different in this case we have what the united states going over there i have to look at the other set up in there they picked a couple different countries uh and it is it's time sensitive you know with the next result is based on what the last one was koba is an excellent example of this anytime you do any word analytics where you're figuring out what someone's saying what they said before makes a huge difference is what they're going to say next another form of time series analysis 10. how is joining different from blending in tableau so now we're going to jump into the tableau package data joining data joining can only be done when the data comes from the same source combining two tables from the same database or two or more worksheets from the same excel file all the combined tables or sheets contains common set of dimensions and measures data blending data blending is used when the data is from two or more different sources combining the oracle table with the sql server or two sheets from excel or combining excel sheet and oracle table in data blending each data source contains its own set of dimensions and measures how is overfitting different from underfitting always a good one overfitting probably the biggest danger in data analytics today is overfitting model trains from the data too well using the training set the performance drops significantly over the test set happens when the model learns the noise and random fluctuations in the training data set in detail and again the performance drops way below what the test set has the model neither trains a data well nor can generalize to new data performs poorly both on train and the test set happens when there is less data to build and an accurate model and also when we try to build a linear model with a non-linear data in microsoft excel a numeric value can be treated as a text value if it proceeds with an apostrophe definitely not an exclamation if you're used to programming in python you'll look for that hash code and not an amber sign and we can see here if you enter the value 10 into a fill but you put the apostrophe in front of it it will read that as a text not as a number what is the difference between count count a count blank and count if in excel we can see here when we run in just count d1 through d23 we get 19 and you'll notice that there is 19 numbers coming down here and so it doesn't count the cost of each which is a top bracket it doesn't count the blank spaces either with the straight count when you do a count a you'll get the answer is 20. so now when you do count a it counts all of them even the title cost of each when you do count blank we'll get three why there's three blank fields and finally the count if if we do count if of e 1 to e 23 is greater than 10 there's 11 values in there basic counting of whatever's in your column pretty solid on the table there explain how vlookup works in excel vlookup is used when you need to find things in a table or a range by row the syntax has four different parts to it we have our lookup value that's a value you want to look up we have our table array the range where the lookup value is located column index number the column number and range that contains the return value and the range lookup specify true if you want an approximate match or false if you want an exact match of the return value so here we see v lookup f3 a2 to c8 2 0 for prints now they don't show the f3 f3 is the actual cell that prince is in that's what we're looking at is f3 so there's your prince he pulls in from f3 a2 to c8 is the the data we're looking into and then number two is a column in that data so in this case we're looking for age and we count name as one ages two keep in mind this is excel versus a lot of your python and programming languages where you start at zero in excel we always look at the cells as one two three so two represents the age zero is uh false for having an exact match up versus one we don't actually need to worry about that too much in this zero or one would work with this example and you can see with the angela lookup again her name would be in the f column number four that's what the f4 stands for is where they pulled angela from and then you have a1 to c8 and then we're looking at number three so number three is height name being one h2 and then height three and you'll see here pulls in her height 5.8 so we're going to run jump over to sql how do you subset or filter data in sql to subset or filter data in sql we use where and having clause you can see we have a nice table on the left where we have the title the director the year the duration we want to filter the table for movies that were directed by brad bird why just because we want to know who what brad bird did so we're going to do select star you should know that the star refers to all in this case we're what are we going to return we're going to return all title directory year and duration that's what we mean by all from movies movies being our table where director equals brad bird and you can see he comes back and he did the incredible and ratatouille to subset or filter data sql we can also use the where and having clause so we're going to take a closer look at the different ways we can filter here filter the table for directors whose movies have an average duration greater than 115 minutes so there's a lot of really cool things into this sql query and these sql queries can get pretty crazy select director sum duration as total duration average duration as average duration from movies group by director having average duration greater than 115. uh so again what are we going to return we're going to return whatever we put in our select which in this case is director we're going to have total duration and that's going to be the sum of the duration we're going to have the average duration average underscore duration which is going to be the average duration on there and then we course go ahead and group by director and we want to make sure we group them by anyone that has an having an average duration greater than 115. these sql queries are so important i don't know how many times your the sql comes up and there's so many different other languages not just mysql and not microsoft sql in addition to that where the sql language comes in especially with hadoop and other areas so you really should know your basic sql doesn't hurt to get that little cheat sheet and glance over and double check some of the different features in sql what is the difference between where and having clause in sql where where clause works on row data in where clause the filter occurs before any groupings are made aggregate functions cannot be used so the syntax is select your columns from table where what the condition is having clause works on aggregated data having is used to filter values from a group aggregate functions can be used in the syntax is select column names from table where the condition is grouped by having a condition ordered by column names what is the correct syntax for reshape function in numpy so we're going to jump to the numpy array program and what you come up with is you have uh in this case would be numpy.reshape a lot of times you do an import numpy as np reshape and then your array and the new shape and you can see here as we as the actual example comes in the reshape is a and we're going to reshape it in two comma five uh setups and you can see the printout in there that prints in two rows with five values in each one what are the different ways to create a data frame in pandas well we can do it by initializing a list so you can port your pandas as pd very common data equals tom 30 jerry20 angela35 we'll go ahead and create the data frame and we'll say uh pd.dataframe is the data columns equals name and age so you can designate your columns you can also it is a index in there you should always remember that the index uh in this case maybe you want the index instead of one two to be the date they signed up or who knows you know whatever and you can see right there it just generates a nice pandas data frame with tom jerry and angela another way you can initialize a data frame is from dictionary you can see here we have a dictionary where the date equals name tom jerry angela mary ages 20 21 1918 and if we do a df pd dot data frame on the data you'll get a nice the same kind of setup you get your name age tom jerry angela and mary write the python code to create an employee's data frame from the emp.csv file and display the head and summary of it to create a data frame in python you need to import the pandas library and use the read csv function to load the csv file and here you can see we have import pandas as pd employees or the data frame employees equals pd.read csv and then you have your path to that csv file there's a number of settings in the read csv where you can tell it how many rows are the top index you can set the columns in there you can have skip rows there's all kinds of things you can also go in there and double check with your read csv but the most basic one is just to read a basic csv how will you select the department and age columns from an employee's data frame so we have import pandas as pd you can see we have created our data we will go ahead and create our employees pd data frame on the left and then on the right to select department and age from the data frame we just do employees and you put the brackets around it now if you're just doing one column you could do just department but if you're doing multiple columns you've got to have those in a second set of brackets it's got to be a reference with a list within the reference what is the criteria to say whether a developed data model is good or not a good model should be intuitive insightful and self-explanatory follow the old saying kiss keep it simple the model develops should be able to easily consumed by the clients for actionable and profitable results so if they can't read it what good is it a good model should easily adapt to changes according to business requirements we live in quite a dynamic world nowadays so it's pretty self-evident and if the data gets updated the model should be able to scale accordingly to the new data so you have a nice data pipeline going where when something when you get new data coming in you don't have to go and rewrite the whole code what is the significance of exploratory data analysis exploratory data analysis is an important step in any data analysis process exploratory data analysis eda helps to understand the data better it helps you obtain confidence in your data to a point where you're ready to engage a machine learning algorithm it allows you to refine your selection of feature variables that will be used later for model building you can discover hidden trends and insights from the data how do you treat outliers in a data set an outlier is a data point that is distant from other similar points they may be due to variability in the measurement or may indicate experimental errors uh one you can drop the outlier records pretty straightforward you can cap your outliers data so it doesn't go past a certain value you can assign it a new value you can also try a new transformation to see if those outliers come in if you transform it slightly differently explain descriptive predictive and prescriptive analytics descriptive provides insights into the past to answer what has happened uses data aggregation and data mining techniques examples an ice cream company can analyze how much ice cream was sold which flavors were sold and whether more or less ice cream was sold than before predictive understands the future to the answer what could happen uses statistical models and forecasting techniques example predicts the sale of ice creams during the summer spring and rainy days so this is always interesting because you have your descriptive which comes in and your businesses are always looking to know what happened hey did we have good sales last quarter what are we expecting next quarter in the cells and we have a huge jump when we do uh prescriptive suggest various courses of action to answer what should you do uses optimization and simulation algorithms to advise possible outcomes example lower prices to increase sell of ice creams produce more or less quantities of certain flavor of ice cream and we can certainly uh today's world with the coveted virus because we had that in our earlier graph you could see that as a descriptive what's happened how many people have been infected how many people have died in an area predictive what do we predict that to go do we see it going to get worse is it going to get better what do we predict that we're going to need in hospital beds and prescriptive what can we change in our setup to have a better outcome maybe if we did more social distancing if we tracked the virus how do these different things directly affect the end and can we create a better ending by changing some underlying criteria what are the different types of sampling techniques used by data analysis sampling is a statistical method to select a subset of data from an entire data set population to estimate the characteristics of the whole population one we can do a simple random sampling so we can just pick out 500 random people in the united states to sample them they call it a population in regular data we also call that a population just because that's where it came from was mainly from doing census systematic sampling cluster sampling stratified sampling and judgment or propositive sampling and we have our systematic sampling that's where you're doing like uh using 1 5 10 15 20. use a very systematic approach for pulling samples from the setup cluster sampling that's where you look at it we say hey some of these things just naturally group together if you were talking about population which is the really a nice way of looking at this cluster sampling would be maybe by zip code we're going to do everybody's zip code and just naturally cluster it that way stratified sampling would be more uh looking for shared things a group has like income so if you're studying something on poverty you might look at their naturally group people based on income to begin with and then study those individuals in the income to find out what kind of traits they have and then judgmental that is where the researcher very carefully selects each member of their own group so it's very much based on their personal knowledge jumping on the 26 what are the different types of hypothesis testing hypothesis testing is a procedure used by statisticians and scientists to accept or reject statistical hypothesis we start with a hypothesis testing we have null hypothesis and alternative hypothesis on the null hypothesis it states that there is no relation between the predictor and the outcome variables in the population it is denoted by h naught example there is no association between patients bmi and diabetes alternative hypothesis it states there is some relation between the predictor and outcome variables in the population it is denoted by h1 example there could be an association between patients bmi and diabetes and that's the body mass index if you didn't catch the bmi and you're not medical describe univariate bivariate and multivariate analysis a univariate analysis it is the simplest form of data analysis where the data being analyzed contains only one variable an example of studying the heights of players in the nba because it's so simple it can be described using central tendencies dispersion quartiles bar charts histograms pie charts frequency distribution tables the bivariate analysis it involves analysis of two variables to find causes relationships and correlations between the variables example analyzing sale of ice creams based on the temperature outside bivariate analysis can be explained using correlation coefficients linear regression logistic regression scatter plots and box plots and multivariate analysis it involves analysis of three or more variables to understand the relationship of each variable with the other variables example analyzing revenue based on expenditure so if we have our tv ads we have our newspaper ads our social media ads and a revenue we can now compare all those together the multivert analysis can be performed using multiple regression factor analysis classification and regression trees cluster analysis principle component analysis clustering bar chart dual axis chart what function would you use to get the current date and time in excel in excel you can use the today and now function to get the current date and time you can see down here with the two examples where just equals today or equals now using the sumifs function in excel find the total quantity sold by sales representatives whose names start with a and the cost of each item they have sold is greater than 10. and you can see here on the left we have our actual table and then we want to go ahead and sum ifs so we want the e2 through e20 b2 through b20 greater than 10. and this basically is just saying hey we're going to take everything in the e column and we're going to sum it up but only those objects where the d column is greater than 10 that's what that means there is the below query correct if not how will you rectify it select customer id year order date as order year from order where order year is greater than or equal to 2016. and hopefully you caught it right there uh it's in the devils in the details we can't not use the alias name while filtering data using the where clause so the correct format is all the same except for where it says where the year order date is greater than or equal to 16 versus using the order year which we assign under the select setup how are union intersect and accept used in sql the union operator is used to combine the results of two or more select statements and you can see here we have select star from region 1 and we're going to make a union with select star from region 2 and it basically takes both these sql tables and combines them to form a full new table on there so that's your union as we bring everything together we look at the intersect operator returns the common records that are the result of the two or more select statements so you can see here we select star from region one intersect select star from region 2 and we come up with only those records that are shared that have the same data in them and hopefully you jumped ahead to the accept the accept operator returns the uncommon records that are the result of two or more select statements so these are the two records or the records that are not shared between the two databases using the product price table write an sql query to find the record with the fourth highest market price so here we have a little bit of a brain teaser uh they're always fun and the first thing we want to do is we're going to go ahead and i'm going to if you look at the script on the left we really want the fourth one down so we're going to select the top four from product price but we're going to order it by market price descending sp order by market price ascending so we do is we take the top four of the market price ascending and that's going to give us the four greatest values and then we're going to reverse that order and do descending and we're going to take the top one of that which is going to give us the lowest value which would be the fourth greatest one in the list from the product price table find the total and average market price for each currency where the average market price is greater than 100 and currency is in the inr or the aud so inr or aud india rupal or australia dollar you can see over here the sql query if you had trouble putting this together you might actually do some of it in reverse and you can see right here where the average market price is greater than 50 remember we use having not where at the end because it's part of the group so group by currency because we want those two currencies and we want the currency india the inr or the aud and as you keep going backwards we're actually going to be selecting the currency the sum of the market price as total price and the average market price as average price so there's our select it's going to come from the product price which is just our table over there and then we have where our currency is in and like i said you can put together however you want but hopefully you got to the end there so this question will test your knowledge in tableau exploring the different features of tableau and creating a suitable graph to solve a business problem and of course taboo is very visual in its use so it's very hard to test it without actually just getting your hands on and if you can't visualize some of this and how to do it then you should go back and refresh yourself using the sample super stored data set create a view to analyze the cells profits and quantities sold across different subcategories of items present under each category so the first step is to go ahead and load the sample superstore data set so make sure you know how to load the sample the superstore data set that's underneath either the connect button in the upper left or the tableau icon up there and be able to pull in the data set and then once you've done that you just drag the category and subcategory on rows and salaries onto columns it will result in a horizontal bar chart so in this one we're just going to drag profit onto color and quantity onto label sort the sales axes in descending order of sum and cells within each sub category and if you're at home doing this you'll see that chairs under furniture category had the highest sales and profit while tables had the lowest profit for office supplies subcategory binders made the highest profit even though storage had the highest sales under technology category copiers made the highest profit though it was the least amount of sales let's work to create a dual axis chart in tableau to present cells and profits across different years using sample superstore data set load the orders sheet from the sample superstore data set drag the ordered data field from the dimensions onto columns and convert it into continuous month drag cells onto rows and profits to the right corner of the view until you see a light green rectangle one of those things if you haven't done this hands-on you don't know what you're doing you're gonna run into a bike so you're gonna be just kind of dropping it and wondering what happened synchronize the right axis by right clicking on the profit axis and then let's finalize it by going under the marks card change some cells to bar and sum profit to line and adjust the size and then we have a nice display that we can either print out or save and send off to the shareholders let's go and do one more tableau design a view in tableau to show statewide cells and profits using the sample superstore data set in here you go ahead and drag the country field onto the view section and expand it to see the states drag the states field onto size and profit onto color increase the size of the bubbles at a border and a halo color states like washington california new york have the highest sales and profits while texas pennsylvania and ohio have a good amount of sales but the least amount of profits we'll go ahead and skip back to python numpy suppose there is an array number equals np or numpy if you're using numpy depending on how you set it up dot array and we just have one to nine broken up into three groups extract the value eight using 2d indexing so you can see on the left we have our import numpy as in p number equals our np array if we print the number we have one two three four five six seven eight nine since the value eight is present in the second row and first column we use the same index position and pass it to the array and you just have number two comma one and you get eight and remember we're in python so you start at zero not one like you do in excel always gets me if i'm working between excel and python where i just kind of flip and usually it's the excel that messes up cause i do a lot more programming suppose there's an array that has values 0 1 all the way up to 9. how will you display the following values from the array 1 3 5 7 9. uh so first of all we go ahead and create the array np dot a range of 10 which goes from 0 to 9 because there's 10 numbers in it but we don't include the 10. we print it out the first thing you want to do is what's going on here with one three five seven nine well if we divide by two there's going to be a remainder equal to one and then from python remember that if you use the percentage sign you get the remainder on there so the remainder is one and then you have the your numpy array and then we just want to do a logical statement of all values that have a remainder of one and that generates our nice one three five seven nine there are two arrays a and b stack the arrays a and b horizontally boy these horizontal vertical questions will get you every time and in numpy we go ahead and we've created two different arrays over here a and b uh the first one is your concatenate np dot concatenate a and b on axes equal one that is the same as h stack and in the back end they're still identical they run the same that's all h stack is a concatenate axis equals one how can you add a column to a pandas data frame suppose there's an imp data frame that has information about few employees let's add address column to that data frame you can see in the left we have our basic data frame you should know your data frames very well basically it looks like an excel spreadsheet as you come over here it's really simple you just do df of address equals the address once you've assigned values to the address using the below given data create a pivot table to find the total sales made by each cells represented for each item display the cells as a percentage of the grand total so we're back in tableau select the entire table range click on insert tab and choose pivot table select the table range and the worksheet where you want to place the pivot table it will return a pivot table where you can analyze your data drag the cell total on the values and sales rep and item onto row labels it will give the sum of the sales made by each representative for each item they have sold and finally right click on sum of cell total and expand show values as to select percentage of grand total real important just to understand what a pivot table is we're just piving it from rows and columns and switching this direction on there and finally we have our final pivot table and you can see the values rolls and sum of total sale so we're going to go ahead and take a product table this is off of an sql so we're going to do some sql here and we're going to use the product and sales order detail table find the products that have total units sold greater than 1.5 million and here's our sales order detail table so we have a product table and a sales order detail table two separate tables in the database and we're going to do is put together the sql query we want to select pp name sum sod unit price as cells and then we have our p p dot product id from production product as pp inner join cells dot sales order detail as sod on pp product id equals sod.product id group by pp dot name comma pp.product id having a sum of saw.unit price greater than the 150 million there that's a mouthful and again these sql queries they start looking really crazy until you just break them apart and do them step by step and what we're looking for is the inner join and how did you do the group by this really wanted to know how do you do this inner join this comes up so much in sql how do you pull in the id from one chart and the information from another chart and the sum totals on that chart how do you write a stored procedure in sql let's create a storage procedure to find the sum the squares are the first n natural numbers so here we have our formula n times n plus one times two n plus one over six and you can see from the command prompt uh or the setup you have depending on what your login is the command is create procedure square sum1 declare our variable at n of integer as begin and then we're going to declare the sum of integer set sum equals n times n plus 1 plus 2 times n plus 1 over 6. and then of course we can go ahead and print those out print first cast amber sign in or our variable as a variable character 20 natural numbers print the sum of the square is cast the at sum as a variable character 40 end then we do the output display the sum of the square for first four natural numbers we have execute square sum one and then we're going to put in four and you can see here where it brings up the first four natural numbers sum of square is thirty write a store procedure to find the total even number between two user given numbers a couple of things to note here first we go ahead and create our procedure you have your two different variables the n1 in two and we go ahead and begin we're going to declare our variable count as an integer we're going to set count equal to zero and then we have while n is less than n2 we're going to begin and if n1 remainder two equals zero so we're gonna divide it by two even number begin we're gonna set the count equal to count plus one we're going to print even number plus cast in as a variable character 10 for printing count is plus cast variable count as variable character 10 end else print odd number plus cast variable number one is variable character 10. and then we go ahead and set the increment our variable one up one so that goes from n one all the way to n two and i'll print the total number of even numbers and you can see here we went ahead and executed it we're going to count the even numbers between 30 and 45 and you see it goes all the way down to 8. what is the difference between tree maps and heat maps in tableau now if you've worked in python other programmings you should automatically know what a heat map is but a tree map are used to display data in nested rectangles you use dimensions to define the structure of the tree map and measure to define the size or color of individual rectangles tree maps are relatively simple data visualization that can provide insight in a visually attractive format and again you can see the squares over here this is our tree map over here with the each block also has this information inside of its different blocks a heat map helps to visualize measures against dimensions with the help of colors and size to compare one or more dimensions and up to two measures the layout is similar to a text table with variations in values encoded as colors in heat map you can quickly see a wide array of information and in this one uh you can see they use the colors to denote one thing and the size of the little square to denote something else a lot of times you can even graph this into a three-dimensional graph with other data so it pops out but again a heat map is the color and the size using the sample super stored data set display the top five and bottom five customers based on their profit so you start by dragging the customer name field onto rows and profit on columns right click on the customer name column to create a set give a name to the set and select top tab to choose top 5 customers by some profit similarly create a set for the bottom five customers by sum profit select both the sets right click to create a combined set give a name to the set and choose all members in both sets and then you can drag top and bottom customer sets onto the filters and profit filled onto color to get the desired results as we get down to the end of our list we're going to try to keep you on your toes we're going to skip back to numpy how to print four random integers between 1 and 15 using numpy to generate random numbers using numpy we use the random random integer function you see here we did the import numpy is in p random arrangement equals np.random.randominteger 1 through 15 of 4. from the below dataframe can jump again on you now we're into pandas how will you find the unique values for each column and subset the data for age less than 35 and height greater than six to find the unique values and the number of unique elements use the unique and the in unique function you see here we just did df heights we're selecting just the height column and we want to look for the unique that returns an array where in unique if we do that on the height or the age we'll return just the number of unique values and then we can do a subset the data for ages less than 35 and height greater than 6. so if we look over here we have a new df remember this is going to be taking slices of our original data frame it doesn't actually change the data frame so our new df equals the data frame or df the data frame where age is less than 35 and the height is greater than six and in case you're not using uh tableau which has a lot of its own different mapping programs in there make sure you understand how to use the basics of matplot library plot a sine graph using numpy and matplot library and python and the way we did this is we went ahead and generate an x we know our y equals np.sine of x if you print out x you'll see a whole value here our matplot library pi plot as plt if you are working in jupiter notebook make sure you understand the matplot library inline that little percentage sign met plot library in line that prints it on the page in the jupiter notebook the newer version of jupyter notebook or jupiter labs automatically does that for you but i usually put it in there just in case i end up on an older version if you print y you can see here we have our different y values and our different x values you simply put in plt.plot x y and do a plot show and before we go let's get one more in we're going to do a pandas using the below pandas data frame find the company with the highest average cells derive the summary statistics for the cells column and transpose these statistics that's a mouthful and just like any of these computer problems break it apart so first of all we're looking for the highest average sales so group the company column and use the mean function to find the average sales you see here by company equals df.group by company once we've done that using the describe function we can now go ahead and look at the summary of statistics on here use the describe function to find the summary so by company those are groups we're just going to describe them and you could actually bundle those together if you wanted and just do them all in one line so here we go by company dot describe you can see we have a nice breakout always good to remember whether you're using any of the packages whether it's tableau or pandas in python or even r or some other package being able to quick look and describe your data is very important and then we go ahead and just do a basic apply a transpose function over the describe method to transpose the statistics all we've done here is flip the index with the column names but if you're following the numbers a lot of times it's easier to follow across one line or maybe you want to average out the count or it's all kinds of different reasons to do that and with that we have come to the end of this video tutorial on data analytics full course 2022 by simply learn i hope it was really helpful and informative thank you and keep learning you
Info
Channel: Simplilearn
Views: 1,204,928
Rating: undefined out of 5
Keywords: data analyst, data analytics, data analytics course, data analytics course for beginners, data analytics course online free, data analytics for beginners, data analytics for beginners full course, data analytics full course, data analytics tutorial, data analytics tutorial for beginners, data analytics tutorial free, data analytics with python, data analytics with r, learn data analytics, learn data analytics from scratch, simplilearn, simplilearn data analytics
Id: CaqJ65CIoMw
Channel Id: undefined
Length: 617min 11sec (37031 seconds)
Published: Thu Jan 13 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.