Guide to AI algorithms - MFML Part 4

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my brave survivors you don't want to play outside we talked all about how you don't need to know anything about what's under the hood of these things it's raining i am ever so slightly more entertaining than being stuck in the rain that's good as long as i can maintain that everything all is well so let's open this black box then and have a look at how some of these things go so we've already looked at linear regression i moved it earlier in our menu and now we'll go through in this order and we'll just get a taste we'll just get a flavor so as with regression we don't go into the nitty gritty because i'll leave that for homework for you enjoy the code enjoy the text box we're just gonna get like a feel for it together very quickly so first things first clustering in a nutshell this is about birds of a feather flocking together our example algorithm that we will look at is k-means and we've already seen this idea earlier in the course here are some photographs and then we said put like things with like and now clustering algorithm made two clusters based on cat 1 and cat 2 or maybe based on is the tongue sticking out or is the tongue not sticking out sitting standing cat selfie not cat selfie or what have you that's clustering algorithms so for our example do you see any clusters here i see some clusters by eye sure something like that yeah okay could we have done this without having labels that is a trick question to see if you are paying attention there are no labels here so of course this is unsupervised learning all we have all we know about is the ingredients and we are finding those clusters ourselves could we have done this without using our eyes sure otherwise what are we doing in this section so welcome to k-means the k in k-means stands for the number of clusters that you are requesting so this is up to you when you are using the algorithm you are ordering as if from a menu i would like three clusters please or i would like two clusters please that's the k what about means the means are something like centers of mass or gravity so don't worry about what the math is behind centroids just think of that kind of as a as a center of gravity and your intuition for it will be pretty good actually i'll be computing them for you but i'll be having you guess first where it's going to be and you'll notice that you'll get the hang very quickly of what this object means so the technique begins by sprinkling some labels throughout our data entirely at random boom we have a blue cluster and a red cluster what do you think of the clustering looks good nice clusters yeah i don't like them either let's compute the centroids shall we first do you think that they will be close together or far apart the blue centroid and the and the red centroid close together sure because there's quite a lot of points and they were assigned at random there's no particular reason why all the reds have to be in one place with a very different scent of gravity from all the blue ones fine wait do you think it's going to be slap bang in the middle here or somewhere else maybe pull the little bit towards where there's more points a little bit down and left here there something like that you'll get closer don't worry as you see more of this okay what's next we will forget the old cluster ids boom forgotten and then we are gonna do something of such astounding brilliance you will never see it coming we are simply going to give each point with the color of the centroid closest to it so red or blue for this one red red blue for this one blue how about this one that is why we let the machine do it it is better at this than you are turns out that's right how do you feel about the clustering now we're getting somewhere yeah looking looking better let's keep going let's recompute the centroids and take a moment to guess where they're going to be they're going to be far apart close together far apart and how are you doing yeah okay you kind of get what these things are now and again we forget the labels and again we do the same thing id of the nearest centroid and then we go round and round doing the same thing until the clusters stop changing spoiler alert there will be no further changes this is the last solution so round and round nothing changes these are our clusters job done so that is k-means now standard burning questions from the audience i want to see if i can anticipate them two questions that are burning how do you pick k that's not on my next slide so i'll address that right off the bat depends if we're doing analytics or if we're doing machine learning so again you can use the same algorithm for different applications if you're doing analytics you don't know what you're hunting for you're just trying to get inspired so what might you do in a loop k equals two anything interesting yet no how about k equals three anything interesting how about k equals four and you look at the results every time and you see if anything feels compelling to you it's a it's a pretty nice approach to data mining if this is a machine learning application you know exactly why you are trying to do this maybe you have some kind of nice app where the users put pictures of themselves and their friends and then based on the photographs for everyone's amusement in some way it splits that into two groups puts party hats of different colors on the two groups and says these are your sports teams for beer pong or something then you know what the purpose of that app is and you've just said it's two teams that you're trying to make so then fit for purpose two will be the split that depends what you're using it for other questions some typical ones i tend to have yelled at me google engineers are like will it stop tell me will it stop will it just keep going round and round because you know what if i get a point that's equidistant between these two clusters it could just like bounce back and forth between the two because the coin keeps flipping oh take a deep breath yes it converges because the actual implementation's a little smarter than this and involves of course a loss function and the loss function has to do with the distance of the points to the nearest centroid and so even if the point is equidistant well the distance computed is still the same so the scoring for the loss function is still the same so even if the point flip-flops from blue to red it still considers itself to be finished and will stop and give you the answer how about same result every time what do you think oh fight fight the answer is no the way that you do the random sprinkling of the labels does matter imagine if you have 50 data points and you ask for 49 clusters try that out you will notice that you don't get the same result for which one has two points in it every time so i have a try at first slide for all our methods and here you will try it first if you have no labels this tends to be the first unsupervised learning thing you go for if you have no particular reason to suspect you have you should be using a different one and this is where you want to split your instances into groups another kind of unsupervised learning approach is one where you are looking for anomalies what is unusual that's a different thing we won't talk about it here this one is where you're going for separating things into groups now strangely i can never understand this students get knn confused with k-means and i don't know why they do this because these techniques have nothing in common except that maybe if you just squint just right these two ends together look like an m that is about it as all these techniques have in common so let's see what k nearest neighbors is it's part of a class of algorithms called lazy learning oh so cute and what is lazier than having your recipe be the whole data set i'm not going to bother to summarize it for a minute and just just use the whole thing so let me show you an example here is a desk area we've got engineers and managers sitting in this desk area we have the north south coordinates and the east-west coordinates and we'll see this thing in action with k equals five so now the technique reads five nearest neighbors you have no idea what's coming that k by the way it is a hyperparameter and so we'll need to tune it and we'll see that shortly but for now meet heather and sophie that's where they sit and we will use five nearest neighbors to diagnose them as engineer or manager first sophie we will find her five nearest neighbors in space there they are and all we do is we say what's the most common label let's give her that label so we say she is an engineer then we do the same thing for heather and that manager is just just closer than this engineer and so heather gets the manager label now of course k matters a hang of a lot which is why you will want to do hyper parameter tuning and the way you can tune it is simply the started one go to two go to three if you wish to hyperparameter tuning is easy for this case it gets hard when you've got lots of different hyper parameters that you have to deal with and it takes a long time to re-run your model and now you can't just brute force find the right answer now let's look at the effect of k if we set k to 15 there is nowhere that poor heather can hide she will get the engineer labeled just simply because there's more engineers if we set k equals one and we have my desk nearby and i'm a decision scientist then a statistician whatever i am then poor heather gets that same title decision whatnot even though there's only one of me and so k equals one is where you get judged by your one weirdo best friend rather than the diversity of the friendships that you hold so you will want to tune and see what's actually working for you so this has nothing to do with k-means this is a supervised learning technique all the following techniques will be supervised learning it was only k-means that was unsupervised in this list so when do we try it first if we've got labels they can be any labels they don't have to be categories they can also be numbers we can find my five nearest neighbors and use them to predict salary by finding the median salary among those nearest neighbors right same kind of reasoning and of course we don't only have to do geographical stuff with it we can say who are my nearest neighbors in the audience in terms of shoulder width maybe and torso length and we can use that for sizing a t-shirt use that to predict t-shirt size and it doesn't just have to be two features we can do this for more features but careful this really requires a tall skinny data set it doesn't like situations where you've got too many features and not enough instances it really suffers from what is called the curse of dimensionality so what is that thing such a machine learning term it's great i like to think of it as some methods really don't like it when data points are lonely in space so let me explain if i shine a bright light from that wall onto there and it projects all your shadows against that wall so now your shadows are in one dimension you are all on top of one another and not lonely at all add another dimension along here and now you're a little more spread out now i'm going to add another dimension which floor of this building are you sitting on it's still same east-west but now the height varies all of a sudden you look around you where are my friends it's lonelier now i add another dimension which hour of the day are you doing your vigil now it's really lonely and so every time i add a dimension i don't require only the same amount of data to compensate for it i need exponentially more people for you to have as many friends around you as you do now and some methods are really sensitive to loneliness and so they are really sensitive to the addition of one more dimension things look very empty around the data point that is what the cursive dimensionality is about okay this is great for fast model updates what can be faster than finding out what heather is and just putting that row underneath the rest of the data set there there's your updated model this is good when local structure is key when you think that maybe if there's a few statisticians around other ones just pop up like mushrooms rather than more statisticians as we head towards the north pole if you're dealing kind of with the first one this is a good bet and in order to pull it off you have to be able to actually store and query the whole data set so that shouldn't take forever next support vector machines that sounds crazy cool and difficult and you'll be so disappointed to know that all it is is about building walls in your data now for this section i'm going to do a fun kind of thing we're going to develop the method step by step so we'll go through actually multiple methods in this class of things made four of them each time we get an upgrade and it's sort of a historical note also how are they upgraded through history and the rules for this section are that we are only allowed to use one straight thing to separate our data so maybe it's a line maybe it's a plane maybe it's a hyperplane if we want to do curvy stuff we have to go to a different method that's allowed it's just not in this section so our application we have a map and we have only some of the buildings on that map highlighted here the blue ones are shiny modern skyscrapers the pink ones are gorgeous old historical buildings and you are a city planner your job is to make a neighborhood boundary where is the historical neighborhood where is the modern neighborhood and what shall we do to separate them how about put a line not that line not that one how about that one okay you'll accept this one will you what we've just done is tried on various lines that fit and then we got the first one that works and we accept that that method has a name stop when you get one that perfectly separates them and its name is perceptron for that you get this name seriously i'm machine learning people now you could have picked that solution you could have picked this solution you could pick that solution there's lots of options there's lots of options how do you choose maybe you have a preference maybe you want to have a principled selection between these well i submit to you powerpoint skills why that thickness of line and not that thickness of line or maybe if we stick to the city planner example you know that one day your successor is going to put some kind of path where you have drawn the boundary and you don't know if they're going to put a bicycle path or a highway or what but you notice that for some thicknesses of roads here it smashes buildings and you still have a heart when you think maybe you don't want to smash buildings unnecessarily and so you say to yourself some of these options are unacceptable i'm not going to consider them and looking at this you're like actually there's a way i can squeeze a slightly larger road in here there it is is the biggest one i can put and the line in the middle of that that's the one i'm going to go for and that is named very creatively i'm guessing it's statisticians maximal margin classifier because that is exactly what it does it attempts to have the maximum margin around the boundary and so that's our next method upgrade what are support vectors i would have called them points that matter see if i take this point and i smash it it doesn't change my solution in the list this one i can move it off the map as long as i don't move it too close to the road it doesn't matter to my solution but these ones over here if i move them i get a different road those are my support vectors and this method is going to think about putting boundaries taking them into account now r this is a plot in r is known for its beautiful plotting and this is not it i don't know what part of the 80s coughed this up like what colors are these what's cyan and magenta and why is this straight line pixelated in this decade the point though is that r is really good for prototyping and this is what you can get with about two lines of code and with a few more you can get the beautiful plot but two lines of code and you fit and plot a basic support vector solution it's pretty nifty it doesn't take a lot of effort whereas if you try to do it in some more serious production worthy thing then it would take you far far longer than these two quick lines so you try it out in this kind of environment or some python package if you prefer that and then you go to the more serious thing when you're like okay this kind of approach is working for me what if your historical buildings have no respect for anything look at them there they're in the middle of our modern neighborhood how dare they we statisticians have a very creative name for this situation wait for it not linearly separable as in you cannot use a line to separate the red from the blue and the previous two methods require you to have linear separability so they cannot work so we must upgrade and our upgrade is to the support vector classifier the support vector classifier is the kind of solution that most city planners would roll around to eventually okay we'll let them violate the line but we'll text them for it and so now we simply add attacks to the loss function things can go on the wrong side but they do get penalized and how much you tax them use a hyper parameter that's up to you that you're going to have to tune so here we go in r we can see that we have not got perfect accuracy there but i much prefer this solution to a horribly overfitted one that tried to do some contorting around these two so that it did have perfect performance i would look at that and be like hmm is that really going to work on the other buildings that we've left off the map this is a pretty sensible and reasonable boundary so that's our support vector classifier the worst possible situation is where the modern buildings grow up around the historical buildings this is terrible because for this whole section we are only allowed to use one straight thing to separate the red from the blue where are we going to put it okay let's try out our previous support vector classifier and it suggests because we have more modern buildings than historical buildings why don't we put the line up there in the top right hand corner at plus infinity plus infinity call everything modern and be done with it it's a little better than calling everything historical and be done with it can we do better we want one more upgrade to support vector machines and what we will rely on is called the kernel trick and to explain the kernel trick to you i have a prop my trusty cloud it's a cloud towel oh yeah my towel of the kernel trick now i need you to use imagination do your best and imagine this same thing on my towel red ones in the center blue ones around them and you need to use a straight thing a line a plane a hyperplane whatever to separate the red from the blue and you look at this and you look at this and you get angrier and angrier and greener and greener until eventually you have become the incredible hulk and the only solution you can think of is to punch the thing because it angers you and what you do is you punch the space adding an extra dimension and converting your other axes and now i can use a plane to decapitate this thing because the red is now above the blue so that is what the kernel trick is and let me show you another plot of that so here i had some data points i then make a punch and project them into more dimensions and in this new space i can use a plane to cut so i'm still using a straight thing and i'm still using support vectors it's all the same business from the previous step it's just i have transformed the underlying space first and then once i've done the cutting i can project down my solution back to the original coordinates it looks like i used a circle i really didn't i used a straight thing i just i just punched the towel first so when frustrated punch the towel and at this point my google engineers are like this is the most technical thing you have talked about all day tell me more about kernels give me all the equations i have to be like no read a textbook that's not in the scope of this thing but then i think okay i can't just leave them hanging i should maybe point them to an educational resource and i thought to myself the internet is a magical place a marvelous place this is called the kernel trick yes i wonder i really wonder is there a website called oneweirdcoloneltrick.com and it turns out that there is such a website and it is in fact run by someone with a sense of humor who has put real learning resources on there that for example is vladimir vapneck who's one of the um co-authors of the original sport vector machines paper and the other co-author is our head of research new york corina cortes and her face is also on there along with a lot of other stuff you can go read if you're interested so oneweirdkerneltrick.com and when we do an svm plot in r we can see that we have much better training performance only one mistake so that's our final method we're like the incredible hulk we punch the space because we get so frustrated we add a dimension we transform the original dimensions also as part of doing that and then we chop it still with a straight thing project it back down call it done so when are you going to try this thing if you've got binary labels if you want a flexible boundary shape not just linear and this is a solution that lends itself best to moderate scale if you're working with really large scale there's another method we'll look at later logistic regression where at google we have done more engineering effort to make that one scale friendly than this one so you'll prefer logistic regression at larger scale and now we'll have trees tree-based methods decision trees random forests let's have a look at what's going on here these are in a nutshell a bunch of if this then that for your data and so i thought why did i manufacture and if this then that rule the traditional programming way about my movie preferences so you can all recommend movies to me so i thought about what i like and i came up with this rule if it is a mystery a thriller or a sci-fi movie hand it to me i'll like it and then i thought but am i not a data scientist come come now should i not maybe add data and see what the performance is of this rule this algorithm or recipe that i have because it doesn't matter whether it came from a machine learning system or i dreamed it out of thin air i can still assess the performance the same way see what it predicts see what the truth is and look at the results so what i did was i grabbed the imdb database i shuffled it and then until i couldn't be bothered anymore and my attention span is about 200 or so i rate it like it don't like it like it doesn't like it like it doesn't like it i gotta watch that one like it like it doesn't like it anyway and i have for you the results from the training set which is something like 100 movies let's find out how this rule does turns out precision is pretty good and i'm aiming for precision here you know like with the books i'm pretty busy i'll never be able to watch all the movies i like so let's have a don't waste my time kind of approach here i can leave out some good ones but i know if you suggested hopefully it's likable and if you follow that rule three out of four are gonna go down nicely it's a relief to me to see that recall is low that suggests that maybe this doesn't cover the full diversity of my movie preferences if this were 100 recall it means i have a very one track preference just that nothing else then i thought all right why don't i let machine learning have a go and i'm going to ask for a really simple model based on any of the features in imdb but a decision tree with just one node i'm going to show you the role in on the next slide i want you to look at it in this order first watch the precision go up then watch the recall go up then read what the rule is while i stand here cringing here we go if it's long i'll like it above 127 minutes yes otherwise no now what does this mean well from a data science standpoint the only thing it means is that in these data this rule outperforms my other rule there is no other thing that it means if i'm tempted to start spinning stories all about how yeah maybe this means something about directors investing more effort in longer movies and so the quality of the movie is whatever or some story about how cassie is so pretentious that if i'm tempted to start making such stories i need to be aware that i'm in danger of overfitting and i shouldn't take myself seriously unless i go and carefully articulate that story as a hypothesis and then test it in another data set that i didn't use to form this impression and if it holds there then it holds now i didn't have to have such a simple rule i could have gone for a slightly more complicated one so let's have two nodes ah dear i'll like it if it's long but if it isn't let's have a thriller slightly lower precision but unfortunately this now covers two-thirds of my movie preferences in that set so simple creature simple creature but i could have an even longer rule a lot of decision tree algorithms if you don't tell them to stop they will just keep going and going and they'll make you quite the tree let's read this one it says isla run time more than 132.5 minutes if yes then is the runtime more than 143.5 minutes if yes is the runtime less than 165.5 minutes af yes is the runtime less than 156.5 minutes then she'll like it and your spidey senses are tingling i hope what word is bubbling to the tops of your consciousness overfitting yeah it fits these data really nicely but it is probably utter nonsense for the rest of reality this rule is not going to work if i try to predict movie preferences outside this data set even though it works really well in training like that 84 precision 88 recall now because they tend to be implemented so they make you these big convoluted rules you will need to run a pruning algorithm to snip snip snip back their enthusiasm and you'll say something like no more than this many notes how do you pick how many notes well it is a hyper parameter that you're going to need to tune so that's why you do that tuning step decision trees are if this than that rules and the resulting recipes are easy to describe to humans notice how i don't say interpret i'm saying easy to describe i mean it's hard to beat if someone asks you how does this thing make its decisions i say well it's as simple as is it above 127 minutes or below very easy to describe but please don't interpret them without going and following up in other data actually checking your interpretation and a little note about interpretability this is a hot topic for machine learning and ai should models be interpretable and then there's all this fighting and it's it's such an angsty debate and i watch this and i just feel like can i can i please translate for both sides why you're miscommunicating all the ones who are anti-interpretation are doing the machine learning side of data science make it work and they're saying come on you don't need to know how it works your whole goal here is make it work so why would you give up performance to be able to ogle that model that has no use to you anyway and the other side is like but how will we get inspired if we don't look in the model and if we can't read it and interpret it well that side is not doing the same work they are doing data mining descriptive analytics they want to have an opportunity to get inspired by data it's an entirely different process that they're after and so both of them are right they're just doing two different jobs if your job is performance don't force yourself to only go for these easy to read recipes get performance and test it properly if your job is to be inspired by data then of course you need to be able to see what it's saying to you and so if your job is data mining descriptive analytics this is a lovely lovely tool i i like to do this when i want to get inspired by data even if this has nothing to do with what my final model is going to look like it's still nice to see how it wants to split the space and suggest to me which parts are worth playing with now behind the scenes what's going on so here i have linearly slippable yellow and green here i have non-linearly separable yellow and green here i have the tree here i have the support vector machine and what we see is what we expect over there linearly separable can be done with a line here it can't what is a tree doing a treat starts with one of the features x1 and says where shall i put my vertical line and the way it makes that decision there's all these equations that you can look at that sound awfully intimidating don't worry about them think of it in terms of mixing cocktails let's say concentrations of stuff it is trying to get most of those metrics have some notion of the purity of the result so look at this glass no matter how fat i make this glass i still have the same concentration of yellow and green until i hit this point and then i have a little bit more green so my yellow concentration has gone down so that's not an ideal place to put the line if i wanted a pure mix of yellow this would be the best place and over there i have the purest mix of green once i've done that split i go alright down this branch anything to do no it's all green so i'm finished it stops short how about over here anything to do well where can i split this one so i get the purest concentration of green on one side and yellow on the other side yeah that one looks good anything else to do nope then i stop and that's kind of how they work and when you have this sort of space it's like trying to tile your poor bathroom when there's a diagonal bit smaller smaller smaller smaller tiles too many branches if you actually plotted this in your training data you might be inspired to do a little rotation of the axis and then it's hole in one so then you can actually solve this with a decision tree because you do a bit of feature engineering you have a lot of situations where lovers of some other method will say hey you don't need that other fancy method look we can do the same thing with our method they just uh tamper a little bit with the features first to make the features friendly to their method and then their method works so you can you can do a lot by altering the features or you can approach it by altering the algorithm so when do you try this first when you've got no labels when you want an easy to read model or when if this then that seems like a promising structure for your problem what is bagging that stands for boosted aggregation and that is not as intimidating as it sounds it also isn't only for trees let's look at it for trees the length of my data set is the number of instances the width is the features what bagging does is instead of trying to swallow the entire data set in one go it at random takes some instances it makes a tree with those it stores the tree then it puts back those instances and it at random takes some other ones so it is sampling with replacement and fitting a tree every time remembering all the trees and then at the end it lets all the trees vote to give you an answer so this one says she like it this one says she like it this one says she won't now this has some fairly nice properties under certain circumstances it actually has better performance than not doing it but even if it didn't have a better performance a very compelling reason for this is that you might not want to swallow your data all at once you can parallelize all that and handle little bits of your data all separately fit all those trees separately and maybe that's just a lot faster and since the performance may even be better that's some of the black magic behind this the boosted aggregation stuff it's worth trying now what is random forests now that sounds like some science fiction thing like something from lord of the rings or the hobbit like the forest rises up and marches around random forests is actually kind of like begging but on the width not the length and usually it's implemented with both of them so you take some sample of the instances and then within them you take some sample of the features so you end up with kind of like a little square of your data and then you fit the tree on that little one and then you do this again and again and again this is a really great technique for data mining is if you run your data set through this if all of your features are equally useless because you're doing this random sampling you should expect that you will see your features crop up in the trees with about the same frequency so if you just count how many times the feature was selected by your tree and some of them are more popular than others then that might tip you off to pursuing that feature so you're inspired to do something with that now if we're going to allow things to vote together why stop it just trees why not allow a neural network to vote next to a tree to vote next to a perceptron to vote next to cassie's opinion to vote next to your opinion all together and have all these opinions vote i can change the weights on how much each vote counts of course cassie's vote gets 95 percent of the weight and so forth that is called an ensemble what ensembles are really great at is covering one another's faults if you will so think about it like this if you wanted to do your decision making buy one awesome benevolent tyrant who is really really good and on the whole better than everyone else in your organization at decision making that would be fine unless there's an achilles heel a fatal flaw on the whole really good but there's a blind spot and if that blind spot is hit that is critical failure well then maybe you want to have a backup decision maker who is not that much worse at all the other parts but covers this critical failure pretty good at that part and has a different flaw well then we can have them balance one another out and hopefully where they are both decent there's not that much of a performance degradation so on the whole this feels like a better system the flip side though is when you start composing all these various methods together the engineering and maintenance and upkeep and reliability issues become a nightmare so a an example a famous example is the netflix prize it was a million dollar prize that netflix gave for making a cool recommendation algorithm and the solution there was a crazy big ensemble model and of course netflix nobly paid up because that did win the competition gave the best performance but they didn't implement that thing because that was not going to work sanely in production just too much going on all right so what is naive bayes what are bayesian methods in general let's summarize to this when in doubt use bayes rule now i don't know if there's bayes rule it is a piece of math which if you're not careful you may fall in love with and then it finds its way into all kinds of places in your life maybe you've tattooed it on yourself or maybe you're making art light fixtures like this is an actual light fixture in the form of bayes rule and so not in the spirit of equation but in the spirit of art appreciation let's make sure we know how to pronounce this formula now there's nothing controversial about this formula if you've heard of classical statistics versus bayesian statistics as a controversial thing that's for the statistical testing that's very philosophically heavy for machine learning though there's no philosophy here this is borrow whatever math you can find lying wherever and if it works it works and doesn't matter and actually if all we're using is this formula there's no controversy at all this is a mathematical formula with a proof it works so how do we read it well whenever there's p with brackets let's pronounce probability of and a and b here are just events events like the coin comes up heads or falls on the floor whatever uh events like your roommate forgot to wash the dishes you know events and this vertical bar that you see over here and here is pronounced conditional on or given that or in a world where so let's practice conditional probabilities real quick we'll pronounce things like the probability that we have a cat given this evidence is equal to blah blah so what does this boil down to computationally it's all about counting things in fact we're going through this quick little demo just to show you the simplicity of the operations involved super simple look at this probability of cat how do we get that how many pictures eight how many of them are cats four fifty percent how about probability of lots of green pixels in a world where cat well very simple we said in a world where cats so why are we bothering with all the other ones we only care about the cat ones and then the formula is simple again it's the the straightforward thing how many total four how many will have lots of green two and there's our answer very easy to make these manipulations so what are we doing dividing ignoring and counting those are very easy for a computer to do as well and that's going to be the point now naive bayes tends to be baby's first spam classifier that's the favorite example uh in machine learning courses when you make the first spam classifier and it's naive in the sense that it says my features have nothing to do with any of my other features the presence of the word free has nothing to do with the presence of the word shipping now that's a little naive we know that there might be some relationship between those words and we know that if we were trying to model the universe carefully if that were our goal this would probably be a bad fit however we're trying to get the thing to work and it might just work even if it's naive so maybe we'll just try it anyway and see how it performs where it's making these clearly ludicrous assumptions fine so what's great about this thing is that it is going to use these elementary operations and the way that these are all composed in a formula i have a little link down there for you in the slides really an undergraduate on their first day of a probability course learns all the components that you can do to put this formula together so if an undergraduate student can do that on the first day no background at all it's not anything too deep it's just composing what we had before and using some simple straightforward manipulations which you can find in the link same principle though counting multiplying dividing ignoring that's it nothing else and finally when it's done you interpret the results as given all this evidence that you've put in that treats all the evidence is completely independent from all the bits of evidence completely independent from one another what's the probability that i have a cat versus a not cat and then you can either output the label that is most likely or you can jump for joy that you're dealing with probability outputs because now instead of just simply getting the flat label you can say hey thing how sure are you if you're saying some number that's too close to 50 maybe you don't know what you're talking about and maybe i prefer not to listen to you maybe i prefer to send this to some other process maybe have a human look at it and make the judgment call whereas if the number is very close to 100 or zero then i can go with what the system is saying so naive bayes when are you going to try it first if you have category labels if you have category features like text for example now having category features is no sweat it's very easy to turn your continuous features into categorical ones you have measured all kinds of things about my torso length and width and whatever or you can turn that into small medium large with no problem going from the category in the opposite direction that's the hard one you can always turn your features into categories when there are many features this is a great idea this handles a lot of features and with text the way that data actually tends to be encoded is you have all the possible words that one could meet in an email and you are asking present or absent or how many times present how many times absent so you have a bunch of zeros we saw in this email no times the word hippopotamus and no time's the word tiger one time free and one time shipping and so we have in that row counts of that sort there's a lot of features this can handle that situation and it's also great when you need your code to be simple the operations are super simple now there's a naive version this counts thing there's also a less naive bayes and that one involves crazy cool things like likelihood function priors warning here this is harder but it is so cool and if you're not careful you will really fall in love with things and then like me you might sink several years of your life into this so watch out maybe don't flirt with it too much if you if you don't want to fall for it what's next regression we've already seen regression it's putting lines through stuff it's about finding lines that are as close as possible to your data like our smoothies example and in statistics regression refers to fitting models of this form what is this saying to us if we imagine a spreadsheet all the features or the columns of the spreadsheet so what are we doing we're finding something to multiply the column by every column gets its own thing to be multiplied by and then once we've done those multiplications we add it up and that's the prediction of that kind of form and this has been around for a really really long time why did it take 200 years for us to roll around to this machine learning business if this has been known for a long time well the way you use the algorithm is different what they were going for is fitting the algorithm and describing their universe and there were notions of statistical testing around that algorithm what we are doing is taking that piece of math after we have split our data trying it out to see if it makes a recipe that looks worthwhile when we apply that recipe to a second data set this entire machine learning thing is the result of enough data so that we can afford to split it into multiple pieces and then apply a lot of different stuff to the first piece check if it works in the next piece and then have a final safety check in the last piece machine learning is an attitude that is around splitting data and when gauss was starting out with this there wasn't a lot of data to go around when are you going to try this linear regression approach if the output is a number a dosage a dollar amount something like that a calorie count and if the value of the feature is more meaningful than just a threshold and to explain this let me tell you a little story and a little story about developing this course in the early days i didn't have these slides the try it if slides and i remember presenting this in zurich and we have some lovely googlers in zurich who really kind and want to make sure that my flight home is as entertaining as possible so i got some emails from them that went as follows hey cassie we see you like long movies so have you heard of this six hour movie and have you heard of this five hour long movie a very kind but it's a fundamental misunderstanding of what the decision tree model before i was saying that model doesn't say the longer the movie the more i'm going to like it that model says if the movie is above 127 minutes recommend it otherwise don't there's nothing about the value of the feature being more meaningful than the threshold so if they'd really understood that message i should have gotten some people recommending me movies that were 128 minutes also or you know two and a half hours not straight to the five hours that kind of reasoning of every additional minute adds an additional bit of goodness that is reasoning that is consistent with this sort of model this kind of model is the one that says longer is better for every additional gram of carbohydrates we add to the mix we get more calories it's not just like there's a threshold if we're above 40 grams of carbohydrates it's infinity calories already and we should all panic so this one it has the structure of adding a little more gives you a higher prediction now a simple linear regression thing does not transform any data for you it just takes those columns find something to multiply them by and off we go but there are other flavors of regression more interesting complicated ones here are some popular ones you may run into kernel regression you already have a sense of what that might involve now that you know what kernels are the one we'll look at is logistic regression logistic regression is one of the most popular approaches to binary classification and we really love that here at google because it is very nicely scalable some binary classification here at enormous scale we like this kind of approach and we like neural networks also so let's have a look we're going to apply this to an utterly made-up example involving can knowing the hours that a student has studied help us guess whether they are going to pass or fail an exam let's see our data on the x-axis we have the number of hours they've studied from 0 to 40. in blue the ones are passes the reds are fails and as you're looking at this year like of course hours of study has nothing to do with passing and failing of course or are you looking at this and thinking yeah there's something to it i'm seeing that it does look like there's a little more failing on the low end of the study spectrum and a little more passing on the high end so let's see if we can make a model we're supposed to be in a putting lines through stuff section aren't we so let's put a line thread now we would love to interpret this as a probability we've seen already in that previous section that it's nice to get probabilities as output but let's read if i study for two hours i have what a negative probability of passing that doesn't even make any sense probabilities can't be negative or more than 100 so if i insist on getting a result between 0 and 100 i'd better try something else so here's what we're going to do logistic regression and i'm not going to show you what's happening there behind the scenes details details i'm going to show you essentially how you should think about this and especially at scale what you're going to do is you're going to do something like a predicted probability in the sample sample probability of passing at each hour and then you're going to do a little bit of converting those individual things to a different scale entirely so you push them through a different function and i've actually grayed out that function it's the logic function you don't have to even worry about what that function is the point isn't simply this when you have done that transformation does it look like a shape that we are familiar with it looks an awful lot like a line now doesn't it and so now in this converted space our studied versus this logit whatever on earth it is we can fit a line model to it great so we can use this model to predict this logic thing was that what we asked for no we didn't want to allege it so let's undo the damage let's take this model and convert it back to our original coordinate space so it's like flip it fit and reverse it and when we pass this thing this line through another function that looks a little more through it right this lovely s-shaped curve this these s-shaped curves they are called what is a psychoanalyst's favorite mathematical function sigmoid freud it's too late in the day for this isn't it so these are called sigmoid or logistic functions and when you see these things in the context of data science pursuits typically that sigmoid serves the function of forcing your output to live between where you have said it must live probabilities have to stay between zero and one okay this is a way to force it to stay there but another lovely thing that you get from the sigmoid curve is something that you might remember if you have ever taken an economics class i have a undergraduate degree in economics so i still remember the nightmare so have you taken who's taken an economics class hey anyone i see a few hands so do you remember really early on there was a law of something diminishing marginal returns exactly this is a way of algorithmically mathematically putting diminishing returns into your model so let's read what it says here if i study for 20 hours pass fail for me is touch and go right close to 50 but let's say that i study an extra 10 hours over that look my probability is pretty close to 100 of passing i'm pretty sure i'm going to pass so if i put another 10 hours in over that am i going to be again that much more likely to pass no i'm already passing or if you want to put it a different way at 20 hours it's touch and go if i reduce by 10 hours i'm failing so i may as well not study standard undergraduate logic so diminishing returns is encoded nicely in this model so you get to peg your output where you want it you get diminishing returns from your sigmoid functions and this is more like a model of probabilities and again to classify new instances either we can simply assign whichever label is more likely so if it outputs for me point five five then i will call it a pass or i can say 0.55 means you don't know what you're talking about model let's call that the i'm not sure zone and actually ask a human for their opinion so logistic regression is a great tool for binary classification and it works well at scale 2 including at google scale we use logistic regression a lot around here now it can also work not only for binary labels as a version that works for multi-class classification and that is multinomial logistic regression so it's not only pass fail or cat not cat you can also use it for cat dog bear elephant giraffe so this one the binary one works for binary labels you can also do multi-class and it's a good choice when you have that sense that diminishing marginal returns is part of what you're dealing with and it gives you probabilities out which is very nice and probabilities are pretty useful for ranking stuff like if you want to serve the user the song that they are most likely to listen to next so for each song you predict the probability that they're going to interact with it and then you can order it by the probability results that's one simple way to do some ranking the last section is neural networks or deep learning i make it sound like there's a choose your own adventure element one or the other they are actually pretty much synonyms deep learning refers to neural networks with more than one layer or deep neural networks and deep neural networks were named when there wasn't much computing power to go around and so we could barely dream of having maybe two layers we stick with one but like maybe one day we can achieve two we shouldn't even hope about three today they've all got many they've all got more than two the the real ones the production ones they might have thousands so they're all deep all neural network stuff is deep learning feel free to to use this as synonyms all right so we're gonna dive into this and a small disclaimer we shouldn't even hope about three today they've all got many they've all got more than two the the real ones the production ones they might have thousands so they're all deep all neural network stuff is deep learning feel free to use this as synonyms all right so we're going to dive into this and a small disclaimer at the end of our less than half an hour section on this topic your understanding of them will not have improved at all hopefully if i do my job though you will understand why that is and you'll be okay with it now feedback i hear from many semester-long courses on neural networks pretty much the same the student spends an entire semester and also doesn't understand them any better so for the low low price of less than 30 minutes you will get the same improvement in understanding as you might get from an entire semester so shall we dive in what are neural networks not that that layers upon layers of mathematical transformations not anything magical just a lot of operations stacked together and let's see an example and i don't want you to take this example too seriously this is just an analogy that's going to get our thinking started so let's have a look in case you didn't realize that it's a really bad idea to let me into the kitchen we are back with a cooking analogy so we're going to have some features and they will be of plant animal or artificial origin and they're going to vary for each individual instance so think of this as the column headings in your spreadsheet so your spreadsheet is sort of turned this always in my example and our first instance has for its plant some broccoli you know how much i like those sardines that's our animal ingredient and for artificial we have mint flavored candy canes because i submit to you that there is nothing natural in these things then what do we have we have an output layer that is going to output something like tasty or not tasty as our label coming out and inside we have some hidden layers which are thus named because they are hidden for your own good because if you look into the hidden layers the abyss stares into you so good advice to leave them hidden but we're not going to take that advice today we're actually going to look inside them and see some of the units in these hidden layers also known if you prefer the more futuristic sounding jargon as neurons and the role of the neuron will be played by a blender so what are we going to do we're going to take our individual ingredients here but we're not going to take the whole ingredient all in one go can of sardines is much too precious we're instead going to take some weight some amount of each ingredient and we will put them all into our blender and we will blend them together and i know what you're thinking at this point you are dying to taste it indeed and think of the taste test as the point where you're going to decide whether uh to allow this unholy mix to move forward or to not use what's there at all this taste test is kind of like the activation function in neural networks we take the thing that we've blended up and we do some kind of transformation to it in this case using our taste buds and asking ourselves is it good enough to move forward or not so think of that little taste test as a as a little twist on the plot you're adding a little bit of non-linearity and i'll tell you why we do that shortly but the taste test we taste it apparently it's yummy and it gets to move forward in its entirety now i'm tired of writing the word weight so i'm going to indicate the weight with the thickness of the line so far so good everyone has blended stuff here in different weights of ingredients here good next we have another unit or blender and we're going to take other weights of these same ingredients blend them up taste it that doesn't go forward and then we'll have another blender we taste that it's delicious and there's more in the next row we blend up these various mixes in various amounts we taste all of them some of these go forward we blend those up in various amounts we taste that and there's our result now all of you have been in the presence of a blender maybe even seven blenders all at once maybe in a store you've all done blending so no part of this is complicated intellectually speaking but if i ask you to express to me what on earth is in this vat especially since now you have to express the taste test procedure as part of that recipe you realize that your head is spinning a little bit see if we didn't have the taste test then you could just say well actually how much of each of these ingredients did we use and you could just collapse the recipe down to like why have all these different blenders you just collapse that down to effectively one blender and then it's not a very interesting recipe but with the taste testing now it's something more complicated and convoluted you have to write the whole thing out and if i ask you to explain it to me that is why you don't understand neural networks not because the individual bits are difficult but because there is so much going on these are they're building out these complicated recipes out of simple components let's see what it looks like in a more realistic setting we have some binary features light switches on and off and we are going to combine them by taking a weighted sum now this does this input doesn't have to be binary just for our example we'll use some binary data to keep it simple so here are column headings and our first row just says one zero zero we take a weighted sum so our weights are numbers and zero times anything is zero so ignore that our weighted sum result is two and then we are going to send it through an activation function to add some non-linearity so that we won't just be able to collapse all the weighted sums down into a single simple weighted sum so let's use one that we are comfortable and familiar with how about sigmoid freud sorry sigmoid freud so we'll use a sigmoid activation function so all we do is we find the 2 we put it in we get 0.9 you don't have to use that activation function you can go and find your favorite on a long list of them the most popular one these days that the machine learning hipsters will tell you to use is relu which sounds a little like an alien i guess but actually it's the one that i was describing with the taste test it's just asking if the output the weighted sum is above a particular number then keep that whole thing and if it's below that number then drop the whole thing actually the hipsters probably wouldn't even go for really there's a whole variety of various activation functions i think maybe a more hipstery thing would be the hyperbolic tangent or something else so there's lots to choose from what are you going to use in the end whatever is available in the implementation of the algorithm that you are working with and if there are several options what are you going to do try a few and see how it's going but wait there's more we are going to take more weighted sums and because we're tired of doing these calculations we get the point already we're going to pretend we did them there you go and we send them through our activation function as well and then we do that again in the next layer more weighted sums more activation functions and finally our output which we then turn into a zero because it's closer to zero than to one again you can all take weighted sums you can all put a number through a simple function that was not hard but when you do this enough times and you stack enough of these operations together if i ask you to write out the recipe that got you this final thing that can make you dizzy and if you don't believe me to have real respect for neural networks i would like you to play a game for homework called do i hate cassie yet so here's how it works for our simple neural network here and i've even left a few things out like bias terms and cool stuff like that just for this simple setting write out the recipe in symbols or you can fill in the values if you want write out the recipe that takes you from your benign binary inputs to the final output what is the actual recipe over here not in terms of the operations but in terms of mathematically with the logs and all that good stuff once you've written that out add another layer with again three units in it and see how much more writing it takes to go from this size with two hidden layers to three hidden layers and you'll notice that your writing has more than doubled but do we hate cassie yet if not add another layer and if you don't hate me yet after that another layer and another one and if you get to seven i'm super impressed because at some point you will be like i have better things to do with my life this is an annoying long recipe that i have no desire to read or interpret or look at it consists of simple components but it ends up being transformations of transformations of transformations very hard to read very hard to write down so you prefer to think simply in these individual components but then when you're like but what is actually going on very hard to see it even though each individual little piece is simple that is why you don't understand neural networks or rather why the recipe involved in neural networks is so difficult to think about and read and understand but the operations that get you there they are not so difficult and tensorflow makes you program in terms of these graphs and these operations rather than that recipe all together now why all these transformations what do they get for you well let's go back to our city planning example with our historical buildings in our modern buildings remember that example with our support vector machines there it was bad enough that we had all our red things all clustered in the center and the blue things around them but we could do that whole kernel trick where we punch through the sheet and we transform it by making the dimensions higher and then cutting it with the plane that worked but how are you going to take that approach to the spiral where are you going to punch it for this to work or maybe you want to try k nearest neighbors how many nearest neighbors and at what point will it break down and start finding neighbors that are from the wrong spiral or where do you want to do the decision tree here all of our methods aren't doing very well for this example now what those transformations do is they torture torture torture the underlying space so much that hopefully they pull it apart so that things got separated under all these crazy transformations and boom you can cut it again and what that means is that you can solve problems like that one where your data set is a bunch of pixel values where are you going to put a line through those pixel values to output cat not cat now you could go and maybe do some feature engineering by hand to go and first ask is there something like ovals for eyes and two of them and look for ears and it's going to be a bit wobbly if you have this cat burrito with no ears instead you could feed in a lot of data and because this thing is so flexible and can do all these transformations itself it will let you take advantage of those features without you having to explicitly go and find them yourself this isn't a free though there is a price here because this is such a complex recipe we saw with the ideal body shape of the data set that the more there is going on in the recipe the more complex it is the more features it uses or the more transformations the more length you're going to need the more examples it will take to make it work so this is very expensive both in terms of how big your data sets need to be for this to be viable and how much computing power it takes because you see there's a lot of stuff going on if you don't have a lot of data you just simply might not be able to take advantage of this really powerful algorithm now another thing is when it says it exploits complex structure and it finds those eyes and noses and so forth this makes it sound like they're gonna do that and show you what they're doing they're not just behind the scenes might take advantage of stuff like edges shapes triangles ovals but you won't know what it's doing so it'll be up to you to make sure that you really pick your test conditions carefully really think about the data set which images are you gonna use to make sure that this thing actually does work the way that you expect it to work check your data set to make sure that there isn't something in there that is found instead of the label that you are looking for like in our earlier example where all of cat 1 had radiator and all of cat 2 didn't have radiator in the background make sure there isn't something special about your data especially when you work with methods like this that don't lend themselves nicely to interpretability it's really really important to do profiling of your input data profiling of your outputs and think about what's going on there please do check don't trust blindly now what is there before we begin with this method from the architecture the settings that you set up you set up how many layers are we going to use is it one two three 3000 whatever you also choose how many units are in each layer and it doesn't have to be the same number of units in in each one and you are going to choose the activation function that you'll use so that's all there before you begin from each data point you get the actual values so if this is a row in your spreadsheet these are the column headings and the first row has 1 0 and 0 in it and then as this flows through your network the values inside the hidden layers are also determined by that particular example and so as your network evolves and changes and learns because it's learning it will learn from each individual data point in the simplest implementation then you might have the same input values for two data points but because the network has changed in between them you might have different hidden layer activations so that's all from the individual instances so what is the part that we're learning this is supposed to be machine learning deep learning what's the bit that's being learned what do i hear from you though waits that mess and i haven't even shown you everything here i've left off the notion of bias units which are going to add another three to to these i just want to keep it simple but even with this oversimplified setting what have we got we've got nine nine and three i can't count weights to deal with for our calories example we also had three features carbs proteins and fats and that was bad enough we had four parameters to tinker with and we were like no way just get on with it optimization algorithm we don't want to have to do this by hand look at this 21 of them now imagine doing that by hand just take a moment in comes a data point and then which one are you gonna change in what order in comes a data point you'll be comparing the output that you want with the output that you got you wanted a one it gave you a zero okay which weight are we going to change in what order turns out that optimizing those weights you know before we die of boredom is actually a challenge and doing this by hand is just not going to work so we're stuck unless we have an optimization algorithm that can do it quickly and that is what the backpropagation algorithm is all about for your purpose backpropagation is the reason that you get solutions here you know in this lifetime rather than in the uh old age of the universe you don't need to know much about it as practitioners because it's already implemented in those packages for you you don't need to go reinvent the wheel so even though it is part of day one of every neural networks course ever if you're finding it tedious you can just ignore that because it's already implemented just think of it as the reason that you don't have to wait forever for the results i remember a fun moment where my friend who has a phd in machine learning who is a researcher was saying oh yeah i have to go teach back propagation tomorrow so i can't go drinking because i have to go relearn it turns out you could be a career machine learning researcher and have entirely forgotten it but somehow it's part of the hazing every time so if you take a course and you're like i'm not getting it it seems annoying do i have to fail out of the course or can i just ignore this one and hope it goes away if you're going for how to be an applied person you can probably get away without worrying about it too much and let's see what's happening we take an instance and we send it through the network just as we sent those numbers through earlier to get a label out on the other side and we call that forward propagation and then when we look at the label we compare it with what the true answer was and we see a mismatch and of course our first response is whom do we blame well to figure out where we should start making adjustments we essentially propagate the effects echo influence of that error back through the network to find where and how much we should make the adjustments that's what that backpropagation algorithm is doing for you and if you do end up looking into it you really want to know what it is mathematically remember calculus remember the chain rule do you like the chain rule i hope you do after this you'll like it out of stockholm syndrome back propagation is just the chain rule a lot and then the adjustment is made and now you take your next instance forward propagate it through get the label get the mismatch back and forth back and forth back and forth like sewing with a needle and as those weights adjust the solution becomes more and more of a good sensible recipe for you that's the simplest implementation now they're more interesting ones where you can train on many instances at once now you'll of course need to start with an initial selection of weights so here's what not to do setting them all to zero anything times zero you already know the answer the punchline will be everything will be zero no learning will happen the next worst thing that you can do is set them all to the same number because fundamentally what this is doing is asking well whom do i blame so you're looking for the most heinous looking culprit and if they all look the same then there's no one to blame and then all of a sudden no more learning happens and you have a problem and that's a that's a kind of noob gotcha that happens also either setting it yourself that way or having it just accidentally end up learning itself into symmetry and then there's no one to blame and then all of a sudden learning stops and you're like what happened and you go to your senior machine learning sensei and they're like oh don't worry about it just whack the tv one time just restart it and see if it's okay and you're like what is this black magic this makes no sense but i just whack it and it works well whack it with another selection of random weights and hopefully you don't end up with that bad symmetric situation and off we go so symmetry is not your friend here and if you end up in a situation with too much symmetry then you might see that learning got stuck and isn't happening another reason it may stop behaving itself stop learning is maybe somewhere in there you're doing a lot of multiplication over a lot of layers you're multiplying fractions with fractions and fractions multiplied with fractions multiplied with fractions multiplied with fractions become very small numbers and eventually your computer can't tell the difference between them and problems now when it comes to this weights thing the way to approach it is pick random starting weights yes they are all wrong but one of them is more wrong than the others and then off we go and i've got a footnote for you on some advice on how to set the weights okay so the pros and cons of neural networks the pros are that these complex transformations have so much flexibility that if everything else has failed these still might succeed they are best at fitting so what's the con what's best at fitting is also best at overfitting exactly and not just that remember the limbo the overfitting limbo that you can get stuck in well to work with these things takes more engineering effort and more computing resources and so now you are writing this limbo merry-go-round but you are writing it slowly very slowly and that is a painful existence so you don't want to start with these if you have no reason to suspect that everything else is useless you only want to be playing this game if you're pretty sure all the simpler stuff isn't going to work out for you now how do we pick the architecture in practice as all kinds of advice out there on the internet things like for input units it's going to be your number of features for output units the number of label classes for the hidden layers start small why start small with the hidden layers overfitting even here even within this class of algorithms fewer hidden layers is a simpler algorithm than more hidden layers and if you don't believe me do that homework exercise of do i hate cassie yet and see just how much more complex the recipe gets as you add each layer and hidden units as all kinds of crazy cool different architecture options some simple ones would be same number of units in every layer so like a pipe or like a triangle a funnel where you start big and you get fewer and fewer as you get to the solution but there's many others why even listen to me when you can play with it yourself you're like i don't want to play with it myself because you just told me it's so much effort to try coding it up we have a fun thing called tensorflow playground you can click on that link and then you go be the architect and without writing a single line of code you can go and see how different architectures affect the solution so you can say i want plus plus four hidden layers and i want to put eight neurons in the first one and i want to put two in the second one and three in the third one and so forth and you're like what activation function do i want i want the hipster one or i want the standard one or whatever else you take that from the drop down and you see how that affects your results on your output and you'll see that there are going to be different scores for training and for testing and you'll notice that you tend to do better less loss in the training than in the testing now at the end of this section neural networks might feel like a black box to you and i hope that now you're okay with that now you understand why there's a lot going on lots and lots and lots of operations and so it's okay don't worry about it don't try to read the recipe that comes out especially if you have a thousand layers and in each one there's a thousand units i mean that's the sort of scale we're talking about with the big applications reading that recipe is going to make no sense at all to you that's okay remember the end of the day proof of the puddings and the eating so take the outputs the output is the same as any other output it doesn't matter what model made it compare it against the value you wanted and see how you're performing so neural networks try it first if yeah there is a try at first section here there are good reasons for this to be the first thing you try those good reasons are that you know everything else is not going to work because other people have already tried similar applications to yours and they figured out that it's useless to try to attack it with a line or a tree or whatever else i wouldn't try something else for images if i'm doing image classification i just go right to neural networks i learn from the mistakes of others speech recognition language translation is the same or if you have a lot of domain expertise and you know that the relationships are really complicated that you're dealing with well then you might want to go straight to here so it's a good idea if your application is something similar to what other people do to ask hey for data kind of like this which algorithms tend to work out and then you start with the simplest among those then i'm kind of loath to say this if you have a lot lot of data and you don't care how much computing costs maybe in that situation you might also go right for it but it is an expensive way to be lazy usually a little bit of analytics goes quite a long way to saving money and getting you decent solutions that are simple all right so thank you that's it you have survived the day and with 30 minutes to spare so 30 more minutes of alibi and no one knows that you're goofing off so enjoy thank you so much for being here and i'm here for questions
Info
Channel: Cassie Kozyrkov
Views: 19,665
Rating: undefined out of 5
Keywords: DataScience, Data, DecisionIntelligence, MachineLearning, Statistics, AI, Analytics, Google, GoogleCloud, Education, ArtificialIntelligence, Decisions, Leadership, Technology, Cloud, Cassie, Cassie Kozyrkov, Tech, Google Cloud
Id: 9PBqqx38WeI
Channel Id: undefined
Length: 91min 1sec (5461 seconds)
Published: Thu Nov 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.