Deep Learning vs Machine Learning in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so now you see my title today we're going to talk about deep learning and machine learning in our so by now we all have seen this Venn diagram right you have to have this by our very own drew Conway the original progenitor of the arc meet up and it's like a rule you need to show this at a data science Meetup all right there's even data science bingo and this is one of the slots that you need to fill but there's rumblings of a new Venn diagram this comes from the deep learning book and what it says is that AI is this encompassing field and that machine learning is a subset of that and within machine learning you have deep learning now some people say that deep learning is its own thing not comparable to the rest of machine learning other people will just say it's one technique of machine learning we're going to dive into both of those for a little bit but first when you're getting into them you need to learn about them so we need some good books and there's plenty of good books in both fields on a theoretical side you have the elements of statistical learning by hasty tip Johnny and Friedman then you have deep learning right to the point by Goodfellow Ben geo and kohrville now these are both math books they're heavy on math heavy on theory if you want to learn how to do the code we have applied predictive modeling by Max or are you max right there and deep learning with our by JJ lair where's JJ JJ somewhere there is now these books are just code so if you wanted to learn the theory first read the other two books you want to learn code read these two books highly recommend all four of them they're really really good reads take it with you when you go to the beach excellent for that now the two disciplines have the same things but different words some reason there's been a massive split and the terminology used between machine learning a deep learning and it could be infuriating all right so let's see some examples of that first up we have linear regression because that is solves a majority of your problems so first up and it looks like we do not have internet for my slide so my latex math is gone Oh No Glade Tech's map is there I just live in my screen wonderful y equals a plus BX simple linear regression we've all seen this I hope we've all seen this if you haven't seen it in terms of statistics this is slope intercept form from middle school all right we've had it a and B but those terms have very different words depending what field you're in machine learning calls it an intercept deep learning calls it a bias then the B term what comes after that machine learning calls it coefficients whereas deep learning calls them weights well that's a problem because in statistics a bias is something completely different in the bias-variance tradeoff huge important concept and weights in statistics are how much you value each row so it's they're completely reusing terms not just a new term they're reusing a term and that could be very difficult to get your head around then we have this curve it is generically a binary scaling curve right we all know the math behind this it's 1 over 1 plus e to the negative x all right well depending on your field it's either either an inverse logit or a sigmoid we needed a new word for cos inverse logit just didn't sound cool then we have l2 penalization incredibly important shows up in a lot of algorithms and beyond machine learning statistics it shows up everywhere this is essentially the sum of the squares of the terms very important penalty is the l 2-norm the squared l 2-norm actually and a machine learning that's a ridge and a deep learning it's weight decay and then when it comes time when you have new data and you're trying to score your new data well that is in mathematical terms y hat equals F hat of X tilde in machine learning that's prediction and in deep learning that's inference but that word is really problematic so you're using it for prediction here but in statistics inference means explanation you're actually explaining things and the first time I heard people talking about using inference for prediction I'm like that's just wrong but throughout history there's been lots of terms for the same thing you just got to learn when to use which one depending on your crowd so let's get into some our code that's we're all here for today the examples are going to be looking at our supervised learning y is a function of X that's majority what people do in particular particularly we are going to look at binary classification because let's be honest 90% of the time when someone has AI they're doing a logistic regression so let's take a look at this let's first look at machine learning traditional ml and we're going to focus on today just two algorithms glim net and X G boost probably two of the most important powerful algorithms in machine learning for either one the first thing we have to do is set up our matrices so in when you're using LM or g LM you can just put in your formula and the data frame it all works nicely you can't do that with these algorithms you need to provide the matrices in numeric format ahead of time fortunately the useful package there's a function called build X that will build a for you and build Y you give it the formula like you would have for a regular model give your data frame and in this case I am saying contrast equals false this way we don't drop the baseline of categorical variables and sparse equals true to have a sparse matrix because sparse matrices take up less space and they compute faster I'm using the credit datasets nice small data set tiny data that everyone's familiar with now while I use build X for this it is theoretically possible to use the recipes package but the function out I wanted wasn't ready yet by time I wrote this but max has promised me it's on github already Thank You max just a little too late for me so let's talk about the elastic net which is a combination of l1 and l2 penalization and that's implemented in glim net by hasty tips Ronnie and Friedman and the really cool thing about that package is it's actually written in 73 lines of Fortran that's it and everyone your molecular know that for trans of verbose language and they somehow did that it's amazing what they did so let's fit our first model with one line of code we not only fit a penalized regression we did cross-validation over the lambda we fed it the X we fed it the Y told us a binomial regression and do ten folds and one line of code we got all of that done nice simple easy and when we're done of that we can visualize the coefficient path using coop F if you're used to using the gloom net package you could plot the model but it's static it's hard to see so I used I graphs to build coop ass which now has it interactively you could zoom in you can hover and see what's happening and as you read this you could see that as your lambda your penalty increases the coefficients shrink towards zero and eventually get to zero and that's your minimization nice quick easy visualization let's say you have one value of lambda and you want to see the point estimates for your coefficients you can use cough plot tell it you want to use lambda dot min and it very quickly shows you the point estimates for each of your coefficients much better than a table of numbers out to like five decimal places you can very quickly and easily see the impact of each individual coefficient now of course if you're using a regular linear model you would have confidence intervals but penalize regression doesn't give you confidence intervals but who cares about that anyway these days no one seems to care about calm intervals right is machine learning and making a prediction that's going to last five seconds so you don't care anymore but maybe you should so that is penalized regression the next algorithm I want to talk about our decision trees and despite their name of decision tree they actually do both regression and classification they're called decision trees not because of the outcome variable was because of a series of decisions as you split up the input variables and well there's lots of ways you can do this we're going to skip straight to X G boost because it's just like the most awesome thing put out there machine learning recently before we actually fit the model we're going to take one more data step we're going to create a special object an x gb d matrix which stores the x and y variables in one object for you to keep them together we're doing that before our training data and our validation data but notice this they call the X the X matrix data and the Y matrix label regression was clearly an afterthought when they were building this right otherwise they would have called it X and Y or predictor and response or input and output they were clearly thinking about classification but now that we have our data ready let's fit the model with one line of code sure I broke it up on multiple lines so it would fit on the screen but it's just one line of code you give it that special data set you tell it you want to use GB tree and that's cool because if you switch it to GB linear it's going to fit an elastic net XG boost can fit penalized regression models we're going to stick to the tree you tell it's a logistic model I do 500 rounds I said early stopping I give it validation data and I get a nice model and what's really cool is this works on a GPU now boosted trees are sequential but when you're searching for the split in each of the trees that could be in parallel and you could do it on a GPU and it could be incredibly fast even faster the next G boost already is and you can even have an experimental concept here where you could do a boosted random forest that's really awesome you can fit multiple trees in parallel and then boo those groups of parallel trees when you've done of this you get a variable importance plot and you can very quickly see which variables were important in the fitting not the impact of the variables but the importance of them and if you want to try to visualize all 500 trees in one you can it does its best to compress 500 trees into one tree and show you representative splits is this understandable in the slightest no but it's a good first effort and we will applaud them for that but in case you want it to be even simpler we can use the carrot package this provides a uniform interface in our for all the machine learning needs it's incredibly simple and it's been around for over a decade this is the original auto ml they couldn't ask for anything better than this so there's two models I just fit each one can be done in a single line of code if you have a formula a data frame and the only difference between these two snippets of code are the method one is glim net one is xgb tree so thank you max for making it so easy so that's machine learning very quickly next we have deep learning and everyone's excited about deep learning now right it's all the rage but I think a better name for it is extreme nonlinear modeling because that's where you're getting the power from it's the nonlinearities it's the activation functions that give you the nonlinearities and that's where the real strength is that's yxg boost works so well it's capturing nonlinear relationships the two primary ways of doing this in our rmx net and Karis MX net is championed by Amazon and Karason tensorflow by Google there's a slight issue you can't have both of them loaded in the same our session but JJ lares promised to fix for me he I'm taking that as a promise now we can do other neural networks in our end net by Brian Ripley has been around for decades the good only do single layer you can also do CMT okay deep net RNN dar CH r CP PDL there's no shortage of deep learning at our despite what people say there's many packages to do it in our so first let's rebuild our matrices same as before but we can't use sparse matrices because neither MX net nor Karis slash tensorflow can handle sparse matrices that's okay though because we're going to have large data we can use iterators and generators to load them into memory a little bit at a time because they use stochastic gradient descent they don't need all the data so first let's look at an MX net this was for awhile the primary way to do it at our it's portable to Windows Linux Android raspberry pi and a CPU or GPU this is an MX net model a simple MX net model we first do a layer dropout to prevent overfitting we have a fully connected layer which is a dense layer all the inputs are connected to all the temporary in-between nodes we use a rel u activation function we do drop out then we do another fully connected layer of another round a drop out and another Lu until we get to our last layer which is just one because it's binary and it's 1 or 0 and we have an output a logistic regression output and that's the sigmoid output or the inverse load yet we're not ready yet we need helper functions we first need to write our own custom log loss function because it's not one built-in so we use log loss because that's better than binary loss if you want to keep track of your results you need to create a logger to keep track of your metrics as you go so now that we build that we can go fit the model so first you need to feed it your x and y matrices you tell the optimizer you want I chose Adam for adaptive momentum you give it your valuation data to see how you're going I'm doing this on a CPU I'm choosing to use log gloss as my metric I'm doing 50 passes through the data I set my learning rate I set my batch size I tell it how the data set up I have a callback to check my metrics and then I'd say verbose and I could run this a lot of options you need to set just right so then we get the outcome of this we can plot it you digraphs and we can see how we're doing overtime clearly this datasets not appropriate for deep learning it's a very messy thing our loss functions all over the place but you get the idea you can see how you're doing and our data set definitely needs to train for much longer if we want to keep doing this well let's see the new kid in town let's check out Karis this is a port of Karis thanks to JJ lair and francois chalet and you by default it uses tensorflow as a back-end it's primarily meant for Linux but it does work on Mac and on Windows and you can use a GPU mostly on Linux I've tried a few ways to hack it into Windows but it's mostly on Linux so let's design our network this is a similar network to before but using Kerris notation and our first layer which is a lake called a layer dense now we tell it we want 512 nodes and instead of specifying the activation layer separately we specify it right in the layer dense function and for your first layer you need to tell the input shape which if you take your dimensions you forget the one that's about the number of samples it's the number of features and if you have multi-dimensional tensors you have to give a few of these but for our data set we just tell the number of columns we then do batch normalization to help with the fit we do drop out then we do another dense layer another passional ization another drop out and we do a sigmoid output but after we set up the network we then have to compile it we set the optimizer to be Adam again and this time our loss function is binary cross-entropy which is the same thing as log loss but we needed a new word we use accuracy as our metric and notice here you're not saving this back to an object it's not like traditional are you're modifying the variable in place so net Kerris has now been modified by calling compile on it and it is a different object now it's been changed it's been mutated so it's very important to note it's a little different than standard R so now that we've compiled it we can go ahead and train our model we tell it the number of epochs the batch size we want our validation data again and then we say callbacks because you can't just say early stopping you need a callback that at the end of each epoch says hey should I stop you need to call back for showing us your metrics you need to call back for changing your learning rate you need to call back for tensor board so after you've gone and trained the model this time only on 10 epochs you can once again see how you're doing over time and compare your training and validation set clearly our model needs a lot of work we need to train it more Andy maybe use different data set but that's the goal behind this not every data set is from suited for deep learning so wrapping up let's what's the big picture of this well you could use centuries-old terms or you could use brand-new terms for machine learning there are many many many packages carrot has 238 algorithms you could choose from deep learning there's about seven but deep learn has only been around for a few years so it is a very very good start machine learning for the most part you could do in a one line of code but in deep learning it takes many many lines of code so you have to know what you're doing it's more verbose machine learning we have carrot thank goodness and deep learning there is carrot but it could only do simple networks it can't really do extreme deep networks you can do simple networks machine learning you might get some explanation out of your models if you're doing a linear model and maybe a single tree decision tree deep learning it's just for predictions and machine learning we already give you strong predictions and deep learning Mike give you better predictions when it does better it does really better it really does better now where they're both the same they both have some really really great bulks out there like I showed earlier I highly recommend all of those and both fields still have a ton of excitement about it and that everyone is very deeply excited about all of these so thank you very much you you
Info
Channel: Lander Analytics
Views: 7,079
Rating: 4.9823008 out of 5
Keywords:
Id: TX2qgtGt3r8
Channel Id: undefined
Length: 19min 54sec (1194 seconds)
Published: Wed Aug 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.