Machine Learning Tutorial: From Beginner to Advanced

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we'll talk about machine learning we'll focus on what it is and why you'd want to use it machine learning teaches computers to do what comes naturally to humans learn from experience it's great for complex problems involving a large amount of data with lots of variables but no existing formula or equation that describes the system some common scenarios where machine learning applies include when a system is too complex for handwritten rules like an face and speech recognition when the rules of a task are constantly changing as in fraud detection and when the nature of the data itself keeps changing like an automated trading energy demand forecasting and predicting shopping trends machine learning uses two types of techniques unsupervised learning which finds hidden patterns in input data and supervised learning which trains a model unknown input and output data so that it can predict future outputs unsupervised learning draws inferences from data sets that don't have labeled responses associated with the input data clustering is the most common unsupervised learning technique it puts data into different groups based on shared characteristics in the data clustering is used to find hidden groupings in applications such as gene sequence analysis market research and object recognition among many others on the other hand supervised learning requires each example of the input data to come with a correctly labelled output it uses this labeled data along with classification and regression techniques to develop predictive models classification techniques predict discrete responses like whether an email is genuine or spam essentially these models classify input data into a predetermined set of categories regression techniques predict continuous responses like what temperature a thermostat should be set up or fluctuations in electricity demand again the big difference here between supervised learning and unsupervised learning is that supervised learning requires correctly labeled examples to train the machine learning model and then uses that model to label new data keep in mind the techniques you use and the algorithms you select depend on the size and type of data you're working with the insights you want to get from the data and how those insights will be used we'll talk more about these techniques in the next few videos for now that is a very brief overview of machine learning be sure to check out the description for more information unsupervised machine learning looks for patterns and data sets that don't have labeled responses you'd use this technique when you want to explore your data but don't yet have a specific goal or you're not sure what information the data contains it's also a good way to reduce the dimension of your data as we've previously discussed most unsupervised learning techniques are a form of cluster analysis which separates data into groups based on shared characteristics clustering algorithms fall into two broad groups hard clustering where each data point belongs to only one cluster and soft clustering where each data point can belong to more than one cluster for context here's a hard clustering example say you're an engineer building cell phone towers you need to decide where and how many towers to construct to make sure you're providing the best signal reception you need to locate the towers within clusters of people to start you need an initial guess at the number of clusters to do this compare scenarios with three towers and four towers to see how well each is able to provide service because a phone can only talk to one tower at a time this is a hard clustering problem for this you could use k-means clustering because the k-means algorithm treats each observation in the data as an object having a location in space it finds cluster centers or means that reduce the total distance from data points to their cluster centers so that was hard clustering let's see how you might use a soft clustering algorithm in the real world pretend you're a biologist analyzing the genes involved in normal and abnormal cell division you have data from two tissue samples and you want to compare them to determine whether certain patterns of gene features correlate to cancer because the same genes can be involved in several biological processes no single gene is likely to belong to one cluster only apply a fuzzy see means algorithm to the data and then visualize the clusters to see which groups of genes behave in similar ways you can then use this model to help see which features correlate with normal or abnormal cell division this covers the two main techniques hard and soft clustering for exploring data with unlabeled responses remember though that you can also use unsupervised machine learning to reduce the number of features or the dimensionality of your data you do this to make your data less complex especially if you're working with data that has hundreds or thousands of variables by reducing the complexity of your data you're able to focus on the important features and gain better insights let's look at three common dimensionality reduction algorithms principal component analysis or PCA performs a linear transformation on the data so that most of the variance in your data set is captured by the first few principal components this could be useful for developing condition indicators for machine health monitoring factor analysis identifies underlying correlations between variables in your data set it provides a representation of unobserved latent or common factors factor analysis is sometimes used to explain stock price variation non-negative matrix factorization is used when model terms must represent non-negative quantities such as physical quantities if you need to compare a lot of text on webpages or documents this would be a good method to start with as text is either not present or occurs a positive number of times in this video we took a closer look at hard and soft clustering algorithms and we also showed why you'd want to use unsupervised machine learning to reduce the number of features in your data set as for your next steps unsupervised learning might be your end goal if you're just looking to segment data a clustering algorithm is an appropriate choice on the other hand you might want to use unsupervised learning as a dimensionality reduction step for supervised learning in our next video we'll take a closer look at supervised learning for now that wraps up this video don't forget to check out the district below for more resources and links a supervised learning algorithm takes in both a known set of input data and corresponding output data it then trains a model to map inputs to outputs so it can predict the response to any new set of input data as we've previously discussed all supervised learning techniques take the form of either classification or regression classification techniques predict discrete responses use these techniques if the outputs you want to predict can be separated into different groups examples of classification problems include medical imaging speech recognition and credit scoring regression techniques on the other hand predict continuous responses a good example of this is any application where the output you are predicting can take any value in a certain range like stock prices and acoustic signal processing now let's say you have a classification problem you're trying to solve let's take a brief look at just a few classification algorithms you could use the logistic regression algorithm is one of the simplest it is used with binary classification problems meaning problems where there are only two possible outputs it works best when the data can be well separated by a single linear boundary you can also use it as a baseline for comparison against more complex classification methods bagged and boosted decision trees combine individual decision trees which have less predictive power into an ensemble of many trees which has greater predictive power it is best used when predictors are discrete or behaved none linearly and when you have more time to train a model keep in mind there are many other classification algorithms these are just two of the most common there are plenty of algorithms to choose from if you have a regression problem as well linear regression is a statistical modeling technique use it when you need an algorithm that is easy to interpret and fast a fit or as a baseline for evaluating other more complex regression models non linear regression helps describe more complex relationships and data use it when data has strong nonlinear trends and cannot be easily transformed into a linear space again these are just two common regression algorithms you can choose from there are many more you might want to consider now let's put it all together and see how this process might look in the real world say you're an engineer at a plastic production plant the plants nine hundred workers operate 24 hours a day 365 days a year to make sure you catch machine failures before they happen you need to develop a health monitoring and predictive maintenance application that uses advanced machine learning algorithms to classify potential issues after collecting cleaning and logging data from the machines in the plant your team evaluates several classification techniques for each technique the team trains a classification model using the Machine data and then tests the models ability to predict if a machine is about to have a problem the tests show that an ensemble of bad decision trees is the most accurate so that's what your team moves forward with when developing the predictive maintenance application in addition to trying different types of models there are many ways to further increase your models predictive power let's briefly talk about just three of these methods the first is feature selection where you identify the most relevant inputs from the data that provide the best predictive power remember a model can only be as good as the features you use to train it second feature transformation is a form of dimensionality reduction which we discussed in the previous video here are the three most commonly used techniques with feature transformation you reduce the complexity of your data which can make it much easier to represent and analyze hyper parameter tuning is a third way to increase your models accuracy it is an iterative process where your goal is to find the best possible settings for how to train them all you retrain your model many times using different settings until you discover the combination of settings that results in the most accurate model so that's a quick look at supervised learning in our next video we're going to take a deeper look at an example machine learning workflow until then be sure to check out the description below for more machine learning resources thanks for watching with machine learning there is rarely a straight line from start to finish you'll find yourself trying different ideas and approaches today we'll walk through a machine learning workflow step by step and we'll focus on a few key decision points along the way every machine learning workflow begins with three questions what kind of data are you working with what insights do you want to get from it and how and where will those insights be applied the example in this video is based on a cell phone health monitoring app the input consists of sensor data from the phone's accelerometer and gyroscope and the responses are the activities performed such as walking standing running climbing stairs or lying down we want to use the sensor data to train a classification model to identify these activities now let's step through each part of the workflow to see how we can get our Fitness app working well start with data from the sensors in the phone a flat file format such as text or CSV is easy to work with and makes importing data straightforward now we import all that data into MATLAB and plot each labelled set to get a feel for what's in the data to pre-process the data we look for missing data or outliers in this case we might also look at using signal processing techniques to remove the low-frequency gravitational effects that would help the algorithm focus on the movement of the subject not the orientation of the phone finally we divide the data into two sets we save part of the data for testing and use the rest to build the models feature engineering is one of the most important parts of machine learning it turns raw data into information that a machine learning algorithm can use for the activity tracker we want to extract features that capture the frequency content of the accelerometer data these features will help the algorithm distinguish between walking which is low frequency and running which is high frequency we create a new table that includes the selected features the number of features that you could derive is limited only by your imagination however there are a lot of commonly used for different types of data now it's time to build and train the model it's a good idea to start with something simple like a basic decision tree this will run fast and be easy to interpret to see how well it performs we look at the confusion matrix a table that compares the classifications made by the model with the actual class labels the confusion matrix shows that our model is having trouble distinguishing between dancing and running maybe a decision tree doesn't work well for this type of data we'll try something else let's try a multi class support vector machine or SVM with this method we now get 99% accuracy which is a big improvement we achieved our goal by iterating on the model and trying different algorithms however it's rarely this simple if our classifier still couldn't reliably differentiate between dancing and running we'd look into other ways to improve the model improving a model can take two different directions make the model simpler to avoid overfitting or adding complexity in order to improve accuracy a good model only includes the features with the most predictive power so to simplify the model we should first try and reduce the number of features sometimes we look at ways to reduce the model itself we can do this by pruning branches from a decision tree or removing learners from an ensemble if our model still can't tell the difference between running and dancing it may be due to over generalizing so to fine-tune our model we can add additional features in our example the gyroscope records the orientation of the cell phone during activity this data might provide unique signatures for the different activities for example there might be a combination of acceleration and rotation that's unique to running now that we've adjusted our model we can validate its performance against the test data we set aside in pre-processing if the model can reliably classify the activities we're ready to move it to the phone and start tracking so that wraps up our machine learning example and our overview video series about machine learning for more information check out the links below in our next series we're going to look at some advanced topics related to machine such as feature engineering and hyper parameter tuning machine learning algorithms don't always work so well on raw data part of our jobs as engineers and scientists is to transform the raw data to make the behavior of the system more obvious to the machine learning algorithm this is called feature engineering feature engineering starts with your best guess about what features might influence the thing you're trying to predict after that it's an iterative process where you create new features add them to your model and see if the result improved let's take a simple example where we want to predict whether a flight is going to be delayed or not in the raw data we have information such as the month of the flight the destination and the day of the week if I fit a decision tree just to this data I'll get an accuracy of 70% what else could we calculate from this data that might help improve our predictions well how about the number of flights per day there are more flights on some days than others which may mean they're more likely to be delayed I already have this feature from my data set in the app so let's add it and retrain the model you can see the model accuracy improved to 74 percent not bad for just adding a feature feature engineering is often referred to as a creative process more of an art than a science there's no correct way to do it but if you have domain expertise and a solid understanding of the data you'll be in a good position to perform feature engineering as you'll see later techniques used for feature engineering are things you may already be familiar with but you might not have thought about them in this context before let's see another example that's a bit more interesting here we're trying to predict whether a heart is behaving normally or abnormally by classifying the sounds it makes the sounds come in the form of audio signals rather than training on the raw signals we can engineer features and then use those values to train a model recently deep learning approaches are becoming popular as they require less manual feature engineering instead the features are learned as part of the training process while this is often shown there promising results deep learning models require more data take longer to train and the resulting model is typically less interpretable than if you were to manually engineer the features the features we use to classify heart sounds come from the signal processing field we calculated things such as skewness kurtosis and dominant frequencies these calculations extract characteristics that make it easier for the model to distinguish between an abnormal heart sound and a normal one so what other features two people use many use traditional statistical techniques like mean median and mode as well as basic things like counting the number of times something happens lots of data has a timestamp associated with it there are a number of features you can extract from a timestamp that might improve model performance what was the month or day of week or hour of the day was it a weekend or holiday such features play a big role in determining human behavior for example if you were trying to predict how much electricity people use another class of feature engineering has to do with text data counting the number of times certain words occur in in a text is one technique which is often combined with normalization techniques like term frequency inverse document frequency word to Veck in which words are converted to a high dimensional vector representation is another popular feature engineering technique for text the last class of techniques I'll talk about has to do with images images contain lots of information so you often need to extract the important parts traditional techniques calculate the histogram of colors or apply transforms such as the haar wavelet more recently researchers have started using convolutional neural networks to extract features from images depending on the type of data you're working with it may make sense to use a variety of the techniques we've discussed feature engineering is a trial and error process the only way to know if a feature is any good is to add it to a model and check if it improves the results to wrap up that was a brief explanation of feature engineering we have many more examples on our site so check them out ROC curves are an important tool for assessing classification models they're also a bit abstract so let's start by reviewing some simpler ways to assess models let's use an example that has to do with the sounds of heart makes given 71 different features from an audio recording of a heart we try to classify if the heart sounds normal or abnormal one of the easiest metrics to understand is the accuracy of a model or in other words how often it is correct the accuracy is useful because it's a single number making comparisons easy the classifier I'm looking at right now has an accuracy of 86.3% what the accuracy doesn't tell you is how the model was right or wrong for that there's the confusion matrix which shows things such as the true positive rate in this case it is 74 percent meaning the classifier correctly predicted abnormal heart sounds 74 percent of the time we also have the false positive rate of 9 percent this is the rate at which the classifier predicted abnormal when the heart sound was actually normal the confusion matrix gives results for a single model but most machine learning models don't just classify things they actually calculate probabilities the confusion matrix for this model shows the result of classifying anything with the probability of greater than or equal to 0.5 as abnormal and anything with the probability of less than 0.5 as normal but that 0.5 doesn't have to be fixed and in fact we could threshold anywhere in the range of probabilities between 0 and 1 that's where ROC curves come in the ROC curve plots the true positive rate versus the false positive rate for different values of this threshold let's look at this in more detail here's my model and I'll run it on my test data to get the probability of an abnormal heart sound now let's start by threshold abilities at 0.5 if I do that I get a true positive rate of 74 percent and a false positive rate of 9 percent but what if we want it to be very conservative so even if the probability of a hearts being abnormal was just 10% we would classify it as abnormal if we do that we get this point what if we want it to be really certain and only classify sounds with a 90% probability as being abnormal then we'd get this point which has a much lower false positive rate but also a lower true positive rate now if we were to create a bunch of values for this threshold in between 0 & 1 say 1,000 trials evenly spaced we would get lots of these ROC points and that's where we get the ROC curve from the ROC curve shows us the trade-off in the true positive rate and false positive rate for varying values of that threshold there will always be a point on the ROC curve at 0 comma 0 in our case everything is classified as normal and there will always be a point at 1 comma 1 where everything is classified as abnormal the area under the curve is a metric for how good our classifier is a perfect classifier would have an AUC of 1 in this example the AUC is 0.9 to 6 in MATLAB you don't need to do all of this by hand like I've done here you can get the ROC curve and the AUC from the perf curve function now that we have that down let's look at some interesting cases for an ROC curve if a curve is all the way up into the left you have a classifier that for some threshold perfectly labeled every point in the test data and your AUC is one you either have a really good classifier or you may want to be concerned that you don't have enough data or that your classifier is over fit if a curve is a straight line from the bottom left to the top right you have a classifier that does no better than a random guess it's AUC is 0.5 you may want to try some other types of models or go back to your training data to see if you can engineer some better features if a curve looks kind of jagged that is sometimes due to the behavior of different types of classifiers for example a decision tree only has a finite number of decision nodes and each of those nodes has a specific probability the jaggedness comes from when the threshold value we talked about earlier crosses the probably at one of the nodes Jakob this also commonly comes from gaps in the test data as you can see from these examples ROC curves can be a simple yet nuanced tool for assessing classifier performance if you want to learn more about machine learning model assessment check out the links in the description below machine learning is all about fitting models to data the models consist of parameters and we find the value for those through the fitting process this process typically involves some type of iterative algorithm that minimizes the model error that algorithm has parameters that control how it works and those are what we call hyper parameters in deep learning we also call the parameters that determine the layer characteristics hyper parameters today we'll be talking about techniques for both so why do we care about hyper parameters well it turns out that most machine learning problems are non convex this means that depending on the values we select for the hyper parameters we might get a completely different model by changing the values of the hyper parameters we can find different and hopefully better models okay so we know we have hyper parameters and we know we want to tweak them but how do we do that some hyper parameters are continuous some are binary and others might take on any number of discrete values this makes for a tough optimization problem it is almost always impossible to run an exhaustive search of the hyper parameter space since it takes too long so traditionally engineers and researchers have used techniques for hyper parameter optimization like grid search and random search in this example I'm using a grid search method to vary two hyper parameters box constraint and kernel scale for an SVM model as you can see the error of the resulting model is different for different values of the hyper parameters after 100 trials the search has found twelve point eight and two point six to be the most promising values for the hyper parameters recently random search has become more popular than grid search how could that be you might be asking wouldn't grid search do a better job of evenly exploring the hyper parameter space let's imagine you have two hyper parameters a and B your model is very sensitive to a but not sensitive to B if we did a three by three grid search we would only ever evaluate three different values of a but if we did a random search we will probably get nine different values of a even though some of them may be close together as a result we have a much better chance of finding a good value for a in machine learning we often have many hyper parameters some have a big influence over the results and some don't so random search is typically a better choice grid search and random search are nice because it's easy to understand what's going on however they still require many function evaluations they also don't take advantage of the fact that as we evaluate more and more combinations of hyper parameters we learn how those values affect our results for that reason you can use techniques that create a surrogate model or an approximation of the error as a function of the hyper parameters bayesian optimization is one such technique here we can see an example of a Bayesian optimization algorithm running where each dot represents a different combination of hyper parameters we can also see the algorithm surrogate model shown here is the surface which it is using to pick the next set of hyper parameters one other really cool thing about Bayesian optimization is that it doesn't just look at how accurate a model is it can also take into account how long it takes to train there could be sets of hyper parameters that cause the training time to increase by factors of 100 or more and that might not be so great if we're trying to hit a deadline you can configure Bayesian optimization in a number of ways including expected improvement per second which penalize --is hyper parameter values that are expected to take a very long time to Train now the main reason to do hyper parameter optimization is to improve the model and although there are other things we could do to improve it I like to think of hyper parameter optimization as being a low effort high compute type of approach this is in contrast to something like feature engineering where you have higher effort to create the new features but you need less computational time it's not always obvious which activity is going to have the biggest impact but the nice thing about hyper parameter optimization is it lends itself well to overnight runs so you can sleep while your computer works that was a quick explanation of hyper parameter optimization for more information check out the links in the description the phrase machine learning brings to mind complex algorithms that use lots of computations to train a model but computations on embedded devices are limited in the amount of memory and compute available now when I say embedded devices I'm referring to objects with a special purpose computing system so think of things like a household appliance or sensors in an autonomous vehicle today we'll discuss the different factors to keep in mind when preparing your machine learning model for an embedded device different types of models require different amounts of memory and time in order to make a prediction for example single decision trees are fast and require a small amount of memory nearest neighbor methods are slower and require more memory so you might not want to use them for embedded applications another thing to keep in mind when determining which models to use on an embedded device is how you will get your model to the device most embedded systems are programmed in low-level languages such as C but machine learning is typically done in high-level interpreted languages such as MATLAB Python or R if you have to maintain code bases in two different languages it is going to be very painful to keep them in sync MATLAB provides tools that automatically convert a machine learning model to C code so you don't need to manually implement the model and C separately so what if after converting a model to see you find out that it isn't going to meet the requirements of your system maybe the memory footprint is too big or the model takes too long to make predictions you could try other types of models and see if the code meets the requirements maybe start with a simple model such as a decision tree alternatively you could go back earlier in the process and see if you can reduce the number of features in the model you can use tools such as neighborhood component analysis which are useful for determining the impact that the features have on the results if you see that some features are weighted low you could drop them from your model making the model more concise certain types of models have different reduction techniques associated with them for decision trees you can use pruning techniques where you drop nodes that provide the smallest accuracy improvement depending on your use case any of these tactics may be appropriate Hardware considerations network connections and budget are all key factors that will influence design decisions this was just a quick overview of embedding machine learning models for more information on preparing models for embedded devices see the links below
Info
Channel: MATLAB
Views: 24,265
Rating: 4.9488115 out of 5
Keywords: MATLAB, Simulink, MathWorks, machine learning tutorial, what is machine learning, machine learning algorithms, deep learning, machine learning, machine learning basics
Id: HNKb4Q72KpA
Channel Id: undefined
Length: 31min 55sec (1915 seconds)
Published: Wed Mar 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.