PySpark Course: Big Data Handling with Python and Apache Spark

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello and welcome to the pi spark course I'm going to add chapters to this course so if you want to skip some lectures or release in some of them you can find the timestamps easily from the description of the video I created a pretty easy data for this course for you to learn the concepts better and easier also I'm going to show data creation codes in every lecture so you can get used to the syntax of the pi spark now we can start Pi spark is a python Library built on top of Apache spark an open source distributed computing framework it allows us to process vast amounts of data in parallel across a cluster of machines this makes Pi spark an ideal choice for Big Data applications where traditional data processing tools may fall short due to scalable tea issues let's talk about some advantages of Pi spark one of the key advantages of Pi spark is its ability to perform distributed computing instead of relying on a single machine buy spark distributes data across a cluster allowing us to process large data sets efficiently this pair of processing significantly reduces the computation time Pi spark stores data in memory which means that the intermediate results can be cached and reused across multiple computations this in-memory processing minimizes the need to read data from disk repeatedly leading to faster data access and manipulation also Pi Spike provides extensive apis and libraries for data multiplication machine learning web processing and more it integrates seamlessly with other python libraries like pandas and numpy enabling data scientists to leverage familiar too let's consider an example use case where we need to analyze customer from a massive e-commerce platform traditional data processing techniques might struggle to handle the huge volumes of data but Pi spark comes to the rescue we can easily distribute the data across multiple nodes in the cluster and process it simultaneously making our only says lightning fast now we will get our hands dirty and set up our development environment to start our PI spark Journey we will be using vs code as our code editor and install Pi spark to leverage its powerful capabilities for big data processing first let's install vs code if you haven't done so already vs code is a lightweight feature-rich called editor that supports python development and integrates seamlessly with pi spark for installing it you can open your web browser and navigate to this site code.visualstudio.com then you need to click on the download button to download the vs code installer for your operating system once the download is complete run the installer and follow the on-screen instructions to install vs code next we need to ensure that python is installed on our system if you already have python installed you can skip this step otherwise follow these steps to set up python you need to open your web browser and navigate to this site python.org and then you need to download the latest stable version of python for your operating system run the installer and make sure to check the box at python to Pat during the installation with python in place we can now install Pi spark Pi spark requires both Java and Apache spark to be installed but don't worry the installation process is straightforward you can open your terminal from here and you need to write pip install Pi spark but I'm going to write F3 since I'm going to run this on Python 3 and it's going to be installed in this way after this downloading process finished we can use Pi Spark we just installed Pi spark now we will dive into two fundamental components of Pi spark spark context and Spark session these components play a crucial low in enabling distributed data processing and providing a high level API for interacting with data let's start with spark context spark context is the entry point to any spark functionality it represents the connection to a spark cluster and serves as the client-side driver program spark context enables your application to access Spark's distributed computation capabilities for this we are going to say from PI spark which as we just installed import Spark context and we are going to run this and we are going to say spark context is going to be a kill to the spark context we are going to say local and Pi spark let's say intro and we are going to run this in the code we import the spark context class from the pi spark module then we create a spark context named Pi spark intro with the master URL set to local indicating that we are running spark in local mode for development and testing next let's explore spark session a more recent addition to Pi spark that provides a single entry point to interact with various spark functionalities it includes the features of both spot context and SQL context making it the recommended way to work with pi spark for that we are going to save from pispark.sql import spark session and we are going to run this and we are going to say spark is going to be echoed the spark session dot Builder dot app name and let's say Pi spark intro and we are going to say get or create and we are going to run like this in this code we import the spark session class from the pi spark.scal module we use the spark session Builder to configure this session and set the application name to Pi spark intro to get our create method ensures that if a spark session already exists it will be reused otherwise a new spark session will be created now let's briefly discuss the use cases for spark context and Spark session we will firstly talk about spark context use spark context when you need fine-grained control over spark configurations and lower level rdd which means resilient distributed data set operations however for most high-level data processing tasks will rely on spark session now we are going to talk about spark session spark session simplifies the process of working with structured data by providing a data frame and data set API it's a recommended entry point for most Pi spark applications especially for data scientists working with structured data and machine learning tasks now we'll explore one of foundational concepts of Pi spark rdds rdd means resilient distributed data set and from now on I'm going to say rdd for that are these form the core of Spark's distributed data processing capabilities enabling us to work with large data sets efficiently we will also delve into rdd Transformations and actions which are essential for processing data in a distributed manner or that these are immutable distributed collections of data elements that can be processed in parallel across a cluster of machines they are fault tolerant meaning if any partition of an rdd is lost spark can recreate it automatically using lineage information let's create a rdd from a list we are going to say data is going to be equal to the 1 2 3 4 and 5 and we are going to say rdd is going to be equal to the so remember it's we used it like s and C parallelized we are going to say and we are going to pass data like this in the code we create a rdd called rdd from a python list data using the parallelized method now we are going to talk about rdd transformations are operations that create a new rdt from existing one Transformations are lazy meaning they are not executed immediately but their execution plan is recorded some common rdd Transformations include map filter flat map and reduced by key let's say squared rdt is going to be equal to rdd.map which we just said we are going to say Lambda X is going to be X and 2 like this we just mapped each element to its Square now let's do something similar even rdd is going to be equal to the rdd.filter Lambda is going to be and we are going to say 0 like this we filtered even elements in the code example we use the map transformation to create a new rdd squared rdd where each element of the original rdd is scored we also use the filter transformation to create a new rdd even rdd containing only the even elements from the original rdd now we are going to talk about rdd actions rdd actions are operations that trigger the execution of Transformations and return results to the driver program or write data to external storage actions are eagerly evaluated and imitate the computation on the rdd some common rdd actions include collect count reduce and save as text file let's use collect and count for this we are going to say collected data is going to be equal to the squared rdd collect I made a type of the squared rdt dot collect and we are going to run like this we can use this for collecting all elements to the driver program I'm going to scroll down a little bit and we are going to say number of elements is going to be equal to the squared rdd dot count and we are going to run like this we can use it for counting the number of elements in the rdd in the code we use collect action to retrieve all elements from the skirt rdd and bring them back to the driver program as a python list we also use the count action to determine the total number of elements in the skirt rdd now we will dive into a data frame and data set apis file for abstractions built on top of rdds that enable us to perform data manipulation and Analysis with ease let's just get this somewhere like this these apis provide a higher level more structured way to work with data making data science tasks in pi spark more intuitive and efficient let's firstly talk about the dataset API data set API combines the benefits of both data frame and rdd apis it provides a typesafe object-oriented programming interface with the performance optimizations of data frame API data sets are available in both Python and scalar that allow you to work with strongly typed data let's import from pispark.sql.types we're going to import struct type struct field shrink type and integer type we are going to create a schema for the data set we are going to say extract type and we are going to open a list like this and we are going to say extract field and let's create a data set with name and we are going to say string type like this and we are going to say true here now we are going to do extract the field and we are going to say h integer type and this one is going to be true like this and let's create another one as struct field let's say salary it's going to be integer type also and it's going to be like true like the similar with the last one let's run our schema here and we are going to insert some data to here let's create a data set like data is going to be equal to the Alice 2 and 8 and let's say 45 000 for its salary and we need parenthesis for this like we are going to add one here and we are going to add one here and let's create another entry as Bob let's say 36 for its age and sixty thousand for the salary and let's create another one as catty let's make this last one 23 and 35 pin third five thousand let's say and now we are going to say DS spark create data frame with data and schema we have we are going to run this and we are going to say DS is going to be equal to the ts.les and let's say employees here let's see the data set we created like we are going to say ds.show and we are going to run like this in this code example we Define a schema using the struct type and struct field classes to specify the column names and data types then we create data set DS from a list of tuples with each Tuple representing a row of data we use the as method to give the data set an alias employee now we are going to talk about data frame API data frame is a distributed collection of data organized into named columns it is conceptually equivalent to a table in a lation database or a spreadsheet with rows and columns data frame API in pi spark provides a more natural and efficient way to handle structured data compared to rdds for using this we can say data file and we can set our set data CSV file like that C2 data.csv and we are going to use like TF spark we are going to use the read CSV like this with CSV and we're going to pass data file headers if it has if data had headers and we are going to say here infer schema is going to be true and in this way we can make spark read our CSV files in the code example we create a data frame DF by reading data from a CSV file using the read CSV method the header true option indicates that the first row of the CSV file contains and column names and the inverse schema true option instructs spark to automatically import the data types of each column normally I was thinking to add some CSV files Json datas for this course but I'm not going to add any external data set since I want to keep it simple as possible for you to understand and I'm also going to show related codes and you can easily app apply this code to your data set with the way I show now we are going to explore working with both structured and semi-structured data in pi spark we will leverage the data set we created in the previous lectures to demonstrate various techniques for handling different data formats here is the data set codes we used we are going to add parenthesis here and we are going to run this for seeing our data once again we are going to use this structured data refers to data that can be organized in a tabular format which rows and columns we have been using structured data in the form of data frames and data sets in the code example we create a data frame DF using spark query data frame specifying the data set and column names the this represents structured data and we can easily work with using data frame operations as we have seen in previous lectures now we are going to talk about semi-structured data semi-structured data doesn't have a fixed schema like structured data instead it has a flexible structure that can include nested data and arrays Json and XML are common examples of semi-structured data I'm going to show you the necessary code for reading them but I'm not going to do any other operations I will just show the code for working with them once again I'm trying to keep it simple and we are going to say Json data file we are going to save our Json file like this path to data.json and we are going to say DF Json is going to be equal to the spark dot read Json and we're going to say Json data file just like this you can use semi-structures Json data and you can read it into a data frame using spark with Json Pi spark infers the schema based on the Json structure semi-structured data is valuable for handling complex and nested data formats let me also show you the XML format for that we are going to say XML data file and we are going to say path to data.xml and for using it like the way we want it we are going to say DF XML is going to be a code to the spark.3. format and we are going to say XML here and we are going to use option row tag and we are going to set our rotor here like employee let's say and we are going to use load XML data file here and this is the code for reading XML file into a data frame in this example we are breeding semi-structured XML data into a data frame using the XML format now we will discuss data cleaning and preprocessing techniques using pi spark data cleaning is a crucial step in the data science pipeline as it ensures that data is accurate consistent and ready for our lasers we will be working with the data set we created in the previous lectures to demonstrate various techniques for cleaning and preparing data one of the common data cleaning tasks is dealing with missing values missing values can cause issues during all leases so it's essential to address them appropriately so let's recreate our data with missing values and we are going to use filling methods on that data let's say data with missing and we are going to say Alice 28 45 000 again and let's create Bob and let's create a num value here missing value and enter salary and let's create Cathy this time we are going to leave salary site empty and I think we are good to go with this let's run this and we are going to say DF missing is going to be a code to the spark create data Frame data with missing and we are going to say name H and salary let's fill missing values in h column with mean of that column for this we are going to say mean age is equals to the diff missing dot select we are going to use age and we are going to say Aggregate and we are going to open a parenthesis and we are going to say h and average and we are going to use collect with this also we are going to add one more here and let me run this we are good to go now we need to imply it so we are going to say DF cleaned is acoustic DF missing n a DOT field and mean age and subset is going to be h now when we check TF cleaned which show we are going to see cleaned we are going to see that the H column is filled with the mean value of it in this code example we create a data frame DF missing from the data set that contains missing values we use the enabled field method to replace the missing values with the mean age and mean salary calculated from the non-missing values feature scaling is either crucial step in data preprocessing and now we are going to talk about it standardize the range of independent variables ensuring they have similar scales common methods include min max scaling and standardization which we can also refer as z-score scaling we are going to import the necessary modules like from byspark.ml feature import min max scalar and standard scalar and from again Pi spark ml feature we are going to import Vector assembler let's create a data frame once again with the original data set we have to remind you we are going to use create data Frame data and we are going to say here name age and salary let's start by feature scaling for this we are going to say assembler is going to be equal to the vector assembler input columns is going to be equal to the age and salary and we are going to say output column is going to be features and we need to do a transformation here like data for scaling and we are going to say assembler.transform transform and we are going to pass our data frame inside of it and let me show you the version it became now like this and we are going to see a new column as features in here let's do min max scaling for this we are going to say scalar min max is going to be Echo to the min max scalar once again we are going to say input column is going to be a coded features and output the column is going to be equal to the scaled features and we are going to run this we are going to say scaled min max is going to be equal to the scalar mean Max this time we are going to use fit first the data for scaling and then we are going to use transform transform data for scaling and let's also run this and check how it looks scaled min max dot show we are going to see the scale features here now let me show you the standardization which we can also call a z-score scaling for this we are going to say scalar standardization is going to be equal to the standard scalar and input column is going to be a code features and output column is going to be equal to the scaled scaled features I made a type of there I'm quickly going to fix this like I'm just going to capitalize this and run the cell now let's also add some stuff to here like let's say with standard deviation is going to be true and which mean is going to be equal to True let's run like this and now we are going to say scaled stun the organization it's going to be equal to the scalar standardization.fit data for scaling and we are going to use transform also of course I made a type of the transform and data for scaling again and we are going to run like this now let me show you the final version of our data frame scaled standardization.show and it is the final version of our scale features in this code example we demonstrate both mean Max scaling and standardization we first use a vector assembler to combine the age and salary columns into a single feature Vector then we apply min max killer and standard scalar to scale the features that was all for the data cleaning and pre-processing lecture now we will explore how to perform exploratory data analysis using pi spark exploratory data analysis is a crucial step in a data science pipeline that allows us to gain insights into our data identifying patterns and make informed decisions about data preprocessing and modeling we will be working with the data set we created in the previous lectures to demonstrate various techniques for Ada using pi spark before we press it let's quickly recap data set we'll be using it contains information about employees including their names AIDS and salaries here is the data we have we are going to run this and now we can talk about exported data analysis and we are going to create TF again when when we talked about it Explorer data only says involves a series of tasks that help us understand the structure and characteristics of our data let's perform some common exploratory data analysis techniques using pi Spark we are going to start by summary statistics summary statistics provide a quick overview of the central Tendencies and dispersion of our data let's go with creating the DF again spark create data frame you already know we are going to give a list like name H and salary here and we are going to run this then we are going to import the necessary libraries here like from byspark.sql.functions we are going to import mean and standard deviation here and let me scroll down a little bit we are going to say summary stats is going to be accorded to the DF that describes same as pandas and we are going to give the column names of course we can use with integers only and we are going to say summary stats that show like this in the code example we calculate summary statistics for the age and salary columns using the describe method this provides information such as count means standard deviation minimum and maximum values for each column also I need to say this once again it only works with integer columns now let's do some data visualization data visualization is a powerful technique to understand data patterns visually we will use the matplotlip and pandas libraries to visualize our PI spark data frame let's start by importing necessary modules for data visualization we are going to import matplotlib dot pipelot as PLT and we are going to import pandas as PD we are going to run this and we are going to say pddf is going to be equal to the DF to pandas it's this easy to convert we just converted our PI spark data frame to a pandas data frame for the visualization now let's create a scatter plot we are going to say plt.scatter is going to be equal to the DF PD div DF and we are going to say h and we are going to use PD DF salary here like this and we are going to say plt.x label is ah and plt.y label is salary and let's add a title like plt.title and it's going to be age versus salary and use plt.show here so it's meaningless because we have three entries but think it with a bigger data set we are just simplifying everything let's create a histogram for seeing the distribution we are going to use plt.histogram like this and we are going to say PD DF and we are going to say age let's do it for age and we are going to say BNC is going to be equal to 10 and let's set Edge cover to Black and we are going to say plt.x label is going to be h plt.fi label is going to be frequency and plt.title age distribution and let's display it like this in the code example we convert to Pi spark data frame DF to a pandas data frame pddf for data visualization we then create a scatter plot to visualize the relationship between age and salary and a histogram to observe the distribution of AIDS now we are going to talk about correlation analysis Collision analysis helps us understand the relationships between different variables for this we are going to say correlation Matrix is going to be equal to the DF stat correlation and we are going to say age and salary and we are going to print correlation between H and salary and we are going to say correlation Matrix here as a quick reminder correlation takes values between -1 and positive 1 and it's so close to the positive one which means they have a positive strong relationship which means that they are going to move together like if the age is increasing salary is also increasing with it in the code example we calculate the correlation between age and salary using the correlation method from the data frames stat module now we will explore the power of Pi spark SQL functions to perform various state of Transformations and multiplications efficiently Pi spark provides a rich set of built-in SQL functions that allow us to process and analyze data seamlessly Pi spark SQL functions provide a wide array of operations for data transformation and Mount Pleasant similar to those found in SQL let's dive into some common data Transformations and Mount Pleasants using pi spark SQL functions by spark SQL functions allow us to transform data in various ways to create new columns or modify existing bonds let's apply arithmetic operations to create a new column for this we are going to save from Pi Spark SQL functions import call and we are going to say TF transformed DF transformed is going to be the F width column increased salary and let's say we are giving a 10 percent raise to the salary and let's use DF transformed show like this in this example we can see that salary is increased by 10 percent we use the vid column function to create a new column increased salary in the data frame we apply an arithmetic operation to increase the salary by 10 percent for each employee now let's use string functions we are going to save from pispark.scalfunctions import concat and later like this and we are going to say DF transformed dfwit a column modified name and it's going to be like concat column name and let's say let employee like this and we are going to run this and we are going to I'm going to capitalize this and it's not going to raise error I'm sorry for that and now we are going to good to go now I'm going to run this which show and we are going to see that modified name is added like Alice employee Bob employee County employee here we use the width column function again to create a new column modified Name by concatenating the name column with the string employee Pi spark SQL functions also enable us to multiply data based on the conditions and perform aggregations now we are going to see filtering example we are going to say DF filtered is going to be equal to df.filter and column h let's save a greater than 25 and when we run this we can just check like the filtered show and in this example we use the filter function to retain only the rows where the H column is greater than 25. now let's see an example about aggregating data we are going to save from Pi spark SQL functions import average and we are going to run this let's set the average salary like we are going to say DF dot Aggregate and we are going to say average column celery and we are going to use that collect here and we are going to add two of this and we are just going to run this like this now we are going to say print average salary and we can print average salary like this in this example we use the aggregate function along with average to calculate the average salary for all employees now we will explore how to Aggregate and summarize data using pi Sparks Group by and window operations these powerful techniques allow us to analyze data at various levels of granularity and perform Advanced calculations we are going to use the same data set we created earlier aggregating and summarizing data are essential tasks in data analysis Pi spark provides two powerful methods to accomplish this group by and window operations the group by operation allows us to group data based on one or more columns and perform aggregations on each group let's start with the example one grouping and aggregating data let's import necessary modules we already imported that but I'm going to import all that or import the necessary ones in every lecture for getting used to syntax of the pi spark we imported average and Max from PI spark SQL functions and we are going to run this let's group the data by age and calculate the average and maximum salary for each age group for this we are going to say grouped data is going to be equal to DF Group by H and we are going to use aggregate average salary and we are going to use maximum salary like this and we are just going to run this and we are going to say grouped data show in this example we use the group by method to group the data by the age column then we use the aggregate method along with average and maximum functions to calculate the average and maximum salary for each age group the window operation allows us to perform calculations across a window of data such as moving averages or cumulative sums for an example of window operations we are going to import from PI spark.sql.findo import window and from PI Spark SQL functions import sum we are going to define a Windows specification like window spec is going to be equal to the window that order by and age let's calculate the cumulative summary of salaries based on the age ordering for this we are going to say DF with cumulative summary DF weight column similar to the salary and we are going to say sum salary we are going to use over window Sprite and we are going to use show on that DF with cumulative summary show you can see that we got some errors but we have the cumulative survey here it worked nicely also you can see that date the data is ordered by the age column here we use the window class to define a window specification based on the age ordering we then use the with column method to create a new column cumulative salary containing the cumulative sum of salaries within specified window now we will explore how to leverage user-defined functions to perform custom data Transformations and computations in pi spark use the defined functions allows us to extend Pi Sparks functionality and tailor data processing to specific requirements we are going to use the same data set for this lecture also Pi spark allows us to Define custom functions using Python and apply them to data frame columns to perform specific operations for example let's define a Sim people user-defined functions to add a prefix to the name column so we can say Define add prefix prefix and let's pass name inside return Mr and name now we are going to import from by spark dot SQL functions import user defined function and from PI spark SQL types import swing type we are going to run this and we are going to say add prefix user defined function it's going to be equal to the user defined function at prefix and we are going to pass string type here now we are going to apply the user defined function to the name column like we are going to say DF with prefix is going to be equal to the DF of it column and prefixed let's say name and we are going to say add prefix user defined function name and we are going to run this the error is called because of the capitalization here we are going to run like this and without a typo it works and we are going to use DF with prefix dot show for seeing the new version we can see that prefixed name here in this example we Define a simple user defined function at prefix to add the prefix Mister to the name column we register the user defined function using the UDF function and specify the return type then we use the with column method to apply the user defined function to the name column and create a new column prefixed name with modified names now let's see another example we are going to see using user-defined functions with multiple columns for this we are going to define a function as calculate total income and we are going to pass 8 and salary into it we are going to say return age and salary it's kinda meaningless but let's Also let's do this for showing you how it works by spark SQL types we are going to import integer type we are going to say calculate total income UDF UDF is going to be equal to the calculate total income and it's going to be integer we are going to use like this and type and we are going to run this now we are going to apply the UDF to calculate the total income it's not total income but just trade it like trade it like that the F with total income is going to be a code to the the off width column we are going to say total income let's say doesn't have a meaning let's add this also maybe some people is watching without sounds and we are going to say calculate total income user defined function we are going to say age and salary here and let's use let's get back to the starting of this cell and we are going to say diff with total income that's show here we Define a UDF calculate total income that calculates the total income based on age and salary we register the user defined function with pi spark specifying the return type as integer type we then use with column methods to apply the user-defined function to the age and salary columns and create a new column total income and we added doesn't have a meaning with the calculated values we are going to start to the machine learning section in the next part I'm not going to make any type of introduction to Concepts or similar similar things like that I'm just going to show you the pi spark syntax for you to use linear and logistic regression and evaluation methods and that's going to be all about the machine learning part of this course machine learning is a type of artificial intelligence that allows sulfate applications to become more accurate in predicting outcomes without being excluded program to do so machine learning algorithms use historical data as input to predict new output values it's not always historical data but you're basically predicting a feature based on other features we are going to start with supervised learning algorithms part of the machine learning supervised learning algorithms which are widely used for tasks like classification and regression we will work with a new data set specifically designed for this lecture to illustrate the concepts and demonstrate how Pi Spark's machine learning library and machine learning packages can be leveraged for these tasks for this lecture we will work with a data set related to house prices the data set contains information about various houses including features such as the number of bedrooms bathrooms scarf footage and the sale price we'll create data set chromatically for demonstration purposes I'm not going to create all the entries one by one but I'm going to say spark is going to be a code to the spark session remember it that Builder dot app name and let's say supervised learning and we are going to use get or create here and we are going to run this it says using an ex existing spark session only runtime SQL configurations will take effect it's not important now I'm going to show you the data we are going to work with I just made a pasting operation I'm not going to write this with my hands and keyboard because it's going to take too much time and this is kind of a crash course so time is important I think we are just going to create a schema here for you to understand the structure let's start by running this and we are going to say schema is going to be a code to the struck type we are going to say you know how to create data in pi spark we created it a lot in this lecture of course both can be saved for this video struct field and we are going to use curve fit and it's going to be float type and we are going to say true but one important thing here we didn't import it float type so let's give it importing it from the SQL functions import type to others we imported but this one we didn't let me just fix the typo there and it's from the types I'm sorry for that we are going to run like this and now it works smoothly let's delete these ones and we can keep creating our schema here like we are going to use tracked field again with bedrooms we are going to save float type and we are going to say true we are going to say struct field bathrooms flow type and we are going to say two and last one select field price it's also going to be float type and let's say two to this also by the way they're all integers not floats but I want to show you the float type also because it's a commonly used thing in pi spark or any in any data operation now we are going to create a DF spark create data Frame data and schema and we are going to run this actually we got errors because of the float types I'm just going to fix it and I'm just going to change them with the integer type and I'm going to be come back in a second I think now we are good to go we are going to run this I just made change on here and we are going to run this and as we can see it works smoothly let's use that dot show and we are going to see our data in here supervised learning involves training a model on label data where each data point is associated with a Target or label the model learns from the features in the data and called the corresponding labels to make predictions on new unseen data let's start with classification we are going to start with classification with logistic regression in this example we are going to perform a binary classification task using logistic regression to predict whether also's price Falls within a specific price range and then we are going to use linear regression also we are going to start by importing the necessary modules we are going to say from PI spark dot machine learning feature we are going to import Vector assembler from here from PI Spark dot machine learning classification we are going to import logistic regression and from pispark.ml we are going to import pipeline here let's run this now we are going to create the feature Vector for this we are going to say feature columns is going to be a code to the Square I need to open a list I told I did that but it didn't apply it let's also add this square fit and bedrooms and bathrooms and we are going to set assembler with Vector assembler we are going to say input columns equals to the feature columns and output column is going to be equal to the features and we are going to run this and we are going to now TF assembled is going to be equal to the assembler.transform and we will pass our data frame inside of it now I'm going to show you how you can split your data by train and test data it's really important in machine learning because you need to avoid your model's performance on test data with training data you are going to make some predictions and you are going to test if they are accurate on the test data because of this this splitting is really important in a small scale data like this as you can see here it's not important also our model is not going to be that accurate but in a larger data sets you need to evaluate your model's performance for this task we can use like we can say train data and test data is going to be equal to DF assembled that random split we have here and we are just setting the test and training size the first one is the trading size using 0.8 means 80 percent of the data is going to be into the is it's going to be in the training data and we are going to say 0.2 for the test data and we are going to run it like this now we are going to train our logistic regression model like logistic regression use logistic regression features column is going to be equal to the features and label column is going to be a good price and let's say model going to be what is duration.fit and we are going to say train data here we made a typo there and I'm quickly going to fix that and run the cell and this is the logistic regression model we have here it's trained on 17 seconds and let's make predictions we are going to save predictions is going to be equal to the model.transform transform test data and we are going to say prediction select features price and prediction actually logistic regression is not the best algorithm for this task linear regression is better because we are predicting a continuous variable like the prediction value here needs to be 1 or 0 for it to be binary I just want to show you all this because I just wanted to show you how you can train a logistic regression model in pi spark but predicting a 0 and a one value like a binary result is going to be okay for logistic regression actually logistic regression is a classification algorithm and we are not doing any kind of classification here because of that it's also raised an error like this it is not a good way to forecast or make predictions the better one like if your data is like the labels are like 1 and 0 like the customer is going to buy the data and it's on and if the customer isn't going to buy then it's zero you can use a logic segregation and it's going to be nice on that but in this case it's not a useful method the linear regression would be super good for this task now we are going to apply linear regression the true way to predict this value let's import linear regression from PI spark machine learning progression import linear regression and we are going to say linear regression is going to be equal to the linear regression and features column is going to be I go to the features and label column is going to be equal to the price and we are going to say model it goes to the linear regression fit train data and we are going to set predictions is going to be a go to the model transform test data and we are going to use predictions dot select features let's write the same price and prediction here and we are going to say that show and here is our result you can see there is no errors because we are using a true type of algorithm if you are predicting continuous variables using regression is the best option in this supervised learning algorithms classification and regression lecture we explore supervised learning algorithms using pi spark machine learning library and machine learning packets don't forget if you have a binary label like 1 and 0 use logic segregation but if you have continuous continuous values in your label like this is the value you have in your label user aggression in this lecture we worked with a data set related to house prices and performed classification using logistic regression which wasn't the best way to away and regression using linear regression supervised learning is a fundamental Concept in machine learning and is widely used for various predictive tasks now we will explore unsupervised learning algorithms which are used to identify patterns and relationships in data without labeled Target variables will work with a new data set specifically designed for this lecture to illustrate the concepts and demonstrate how Pi Sparks machine learning library and machine learning packages can be leveraged for these tasks for this lecture we will work with a data set to customer segmentation the data set contains information about customers including their spending behavior on different product categories you create the data set grammatically for demonstration purposes for this dataset creation process we can also create a spark session like we can use spark spark session but it's not necessary and it's not going to change anything so I'm not going to use that I'm just going to paste the related data here like this this is the data that I'm going to use in this lecture and I'm just going to run this data and now we are going to create a schema together schema is going to be a code to the extract type field we are going to say customer ID it's going to be string type and it's going to be true and we are going to say struct field spending Electronics it's going to be integer type and true we are going to say extract field spending fashion it's going to be integer type and true here and the last construct field spending spending grocery it's going to be integer type and two again there is something wrong because we can see from the colors like this let me quickly fix this we need to add this into here and now we can create our schema like this now we are going to use DF spark crate data Frame data and schema like this I'm going to fix the typo and it's going to be cool I guess let's display the data frame we have we are going to use DF show and we can see our data frame here now we can talk about unsupervised learning unsupervised learning algorithms aim to discover patterns and structures in data without using any labeled Target variable remember the label we have I'm saying label label but let me show you the price is the label in this data like we are trying to predict the price so it's the label but in the new data we created there is no such thing as label think it as grouping two common unsupervised learning tasks are clustering and dimensionality reduction we are going to talk about one of them we are going to start with clustering with the most popular algorithm K means in this example we will perform customer segmentation using the k-means crossing algorithm to group customers based on their spending Behavior it's type of distance based algorithm but I'm not going to talk about algorithm I'm just going to show the syntax of it we are going to import okay means from pisepark dot machine learning dot clustering import K means and we are going to run this we are going to create the feature Vector using Vector assembler but I'm not going to import it again let's use feature columns and we are going to say spending Electronics spending version and spending grocery we are going to run like this we are going to set assembler and we are going to say Vector assembler input columns is going to be a good going to be a code to the feature columns and output column is going to be equal to the features we are going to say DF assembled is going to be equal to the assembler dot transform data frame let's train the k-means model like k-means is going to be accorded K means we are going to say features column is going to be equal to the features and K I'm not going to say what it is but it's think it like neighbors scorping neighbors and we are going to run like this we are going to say model is going to be equal to the K means that fit DF assemble it and we are going to run this now we are going to make predictions also and we are going to say model dot transform DF assemble it we are going to run like this also you are going to remember that we had train and test data in the supervised learning but it's unnecessary since there is no labels on we can't we can't evaluate the model's performance because we don't have labels in this let's display the predictions we made we are going to say predictions dot select customer ID features prediction and we are going to say show here like this and you can see that we grouped customers based on their purchasing behaviors like they have groups like 2 1 0 and we can give in we can give decisions on this information like let's think about it let's brainstorm about this if the customers with the group 2 is purchasing more electrics and more likely to purchase on that category we can just show more ads to them in the necessary field now we are going to talk about dimensional reduction with principal component analysis we will reduce the dimensionality of the data using principal component analysis to identify the most important features for this we are going to say from PI spark machine learning feature import principle component analysis in a small data set like this with 8 entries it's really unnecessary but you know I'm just showing you the syntax of it we are going to say PCA is going to be echoed to the PCA K is going to be two input column is going to be equal to the features and output column is going to be a code to the PCA features and let's run this firstly we are going to say model is going to be equal to the PCA fit DF assemble it we are going to say DF PCA is going to be a code to the model.transform DF assemble it we are going to run like this let's display the PCA transform data for this we are going to say DF PCA dot select and we are going to say customer ID and principle component analysis features and we are going to say dot show and truncate is going to be equal to the false let's run this and we can see our principal component analysis features in this unsupervised learning algorithms lecture we explored unsupervised learning algorithms using pi spark machine learning library and machine learning packages We performed customer segmentation using k-means clustering and reduce the dimensionality of the data using principal component analysis unsupervised learning is a powerful technique for finding hidden structures in data without relying on labeled Target variables that was all for the pi spark course thanks for watching I create content about data science and programming if you are interested in these topics and want to see more content like this you can subscribe to my channel I also have courses about Exile python python libraries Powers which is another great big data handling Library written in Rust also I have courses about data visualization like monthly course matplotlip course if you are interested in a data science or data related topics you are going to find good videos on my channel I think thanks for watching this course I hope you enjoyed it and I hope you found it informative have a great day

Info

Channel: Onur Baltacı

Views: 17,659

Rating: undefined out of 5

Keywords: Data Science, Data Analysis, Data Engineering, Machine Learning, Deep Learning, SQL, Python, Data, Spark, Apache Spark, Big Data

Id: jWZ9K1agm5Y

Channel Id: undefined

Length: 67min 43sec (4063 seconds)

Published: Fri Aug 04 2023