Jeremy Howard on Platform.ai and Fast.ai (Full Stack Deep Learning - March 2019)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Original Title: Guest Lecture 3: Jeremy Howard - Full Stack Deep Learning - March 2019
Author: Full Stack Deep Learning
Description: Jeremy Howard (https://twitter.com/jeremyphoward) has a long track record of success in machine learning, through Kaggle, Enlitic, and most recently, Fast.ai.
Youtube URL: https://www.youtube.com/watch?v=hZd3X_nGdew

👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Jun 21 2019 🗫︎ replies

Captions

I want to introduce our first guest speaker today Jeremy Howard Jeremy comes from FAFSA AI and there's also a involved with University of San Francisco and founder of many startups and so thanks so much it's a great pleasure to be here I'm also involved with dr. day I and platformed on AI today I'll be particularly focusing on research and results from faster AI and platform delay how many people here have used a the first day a first date across or library or research most of you okay so I'll just I won't climb with this so basically faster day I has various online courses we do research particularly into how to use less resources to get cutting-edge results so we got our state-of-the-art results in text classification and in training imagenet networks and most of the results we show actually use a single GPU generally not running for more than overnight the imagery results were on AWS cloud running for 18 minutes at a cost of forty dollars for instance our software everything we provide is totally free no ads it's not a great business model but so be it so our software is something that's designed to make kind of cutting-edge results easier to use and get people started with deep learning and also to support researchers so anyway if you're interested in any of those things you can grab them from our website I want to talk about something which is about humans a little bit so let me ask you how many people here have been involved in some way with studying how machines loans are studying machine learning many people everybody ok how many people here have spent some time studying the theory around how humans learn so studying human money very few ok so that's what I expected it's pretty normal answer to that question but I think this represents a problem because humans are an important part of the loop when we're building predictive models or or creating things that are going to take actions so I want to try and convince you that you should be thinking about humans and specifically humans do some things better than computers in fact humans have about 100 trillion synaptic connections which is a lot bigger than any machines we have today even when you network them all together brains run very very slowly but we have a massive amount of interconnect and really great software as Peter mentioned machines do some things better machines run very very very quickly and remember things really well and so right now there are lots of things that humans do better than computers and lots of things that computers do better than humans so why would you focus on solving a problem with just one of those things how do we use them together for example if you're interested in computer vision then maybe you should be studying human vision there's a decades of research around what humans are good at seeing so for example are we better at seeing color can you see which thing is different in this picture or are we better at seeing shape can you see which thing is different in this picture obviously we're better at seeing color it's much harder to see that there's a red circle here are we better at seeing orientation or length or closure or size or what right so if you're studying vision you need to know the answers to these questions about how human vision works so that you could incorporate human vision into your solutions if you're interested in this topic and I hope you should be then should check out things like 39 studies about human perception in 30 minutes which would be a great way to get started on the massive research literature into this topic or this fantastic book data visualization a practical introduction but there are many but the important thing to recognize is the question of what are humans good at is one that is very widely studied in every field around things that humans do such as looking at things and so all of the work that you've done on how do we get computers to look at things there's at least an equal amount of work on how to humans look at things so this is interesting to me because I think the answer to this this this or the the kind of the result of this the implication here of there are some things humans are better at there are some things computers are better at is that this whole field of auto ml which Peter was briefly referring to and it is indeed rapidly moving along I think that field has got it totally wrong why would we try to automate ml rather than instead figure out how to do augmented ml in other words why are we trying to figure out how to get the best results we can with no humans at all because we have humans right why don't we figure out how to get the best results we can with humans and computers working together and it turns out when you combine humans and computers together you can get much better results than you can with auto ml and even as auto ml improves or that Plus humans plus things that combine the two together optimally is always going to beat just using computers up until the point where computers are better at everything and that's going to be a long long way away there's no particular sign it's going to happen in my lifetime I don't know if and when it will happen but not for a long time ok so this is why I think this is something that we should be studying so if we build a thing we build a thing because we wanted to see what would be possible if we took one little narrow area and said let's spend some serious time combining what humans are good at and what computers are good at for one specific topic which is image vacation okay so here's the thing we built it's called platform delay and let's see what it can do so what we're going to do is we're going to start with a bunch of pick okay we're going to start with some pictures of cars which we've uploaded you know to tens of thousands of cars to this platform a system and it basically lays them out like so and what we're going to do is we're going to try and create a model of cars based on which direction are the cars pointing and so you can see in the system we can basically zip through various different projections to try and find some where some cars which are pointing in one direction or another direction kind of appear more towards one side or the other in the system and so a human is very good at rapidly figuring that out so the human this case is found oh here's a projection where this in cars which on the whole are generally pointing to the left right so we can go so we can go through and find those and so now we've got 34 cars already labeled pointing to the left oh and there's a few that seem to be pointing to the right we'll just double check so one of the things that humans are good at is humans are good at rapidly identifying similarities and rapidly identifying differences so so we're good at seeing things that are similar so that's how we can look at a projection and quickly say oh there's an area of the projection where things look pretty similar and then we can zoom in to the things that we selected in the projection and quickly find those ones which are different so by using this insight about human perception that we're good at finding similarities and differences is kind of step one to allowing us to augment the computer to allow us to do here some rapid labeling so now we've quickly labeled some left and right we could also realize that it's seems to be very difficult to find backs so now we use as an extra feature in the system called fine similar we click three backs manually and click find similar and it pops up a whole bunch of images that hopefully are similar to those in some way and so here we're leveraging another insight which is humans have judgment that's in this case the human realized oh there's a certain type of car I'm not finding very well so I will specifically ask the computer to do a particular kind of model which is fine things similar to the ones that I've clicked on and so once we've done that we can now zip through other projections and we can again now try and find once and so now we're having trouble finding fronts so we can click on a few fronts and click find similar and it makes sense right fronts and backs do look a bit the same so this is a was based on a pre trained imagenet model and pre trained imagenet models are specifically trained to be position invariant right so it's not surprising it would have trouble all right but so we're giving it a few examples to kind of get started and we're giving some hints to get started so you can see the human and the computer a kind of having a conversation here right but because this is a vision model we're trying to build the conversation we're having is a vision conversation where we're moving things around and selecting things right and then the computer is trying to say oh I see what you're doing there so we've now clicked train and we've already got a model that's 92% accurate right so we started with something with no labels at all okay and we're trying to classify a hundred thousand images and very quickly we've managed to find a few you know labeled just a hundred or so hundred and twenty using this kind of semi automated system and this dialogue between the computer and the human and you can see actually I'll just go back and show you a step that we kind of quickly flicked over there you can see that this system has allowed us to go more quickly than we could otherwise so after training kind of pre training in a sense we actually now get better embeddings all right so one of the nice things is now that this next step of labels we can do more quickly but here's a cool trick we're having trouble with fronts and back still so we can click front and click back and then say find a projection that maximizes the difference between those two right so again we're kind of our part of the dialogue here is to say oh we're having a lot of trouble finding interesting fronts and backs so please computer find a projection that maximizes that difference all right so this is kind of how this system works here's another example if we were working with faces right it turns out that we were trying to separate female faces and male faces again image nets not particularly designed to do that but when we put it into the system it turns out their first projection that came up with was already pretty good at separating out men's faces and women's faces so we could select a whole lot at once and we can go through and say okay they're all women after double-checking that they were right and then we can kind of grab a whole bunch more and say these all look like they're men so you can see again we're using our like oh they look similar perceptual competency and then we can zip through and do that which ones look different perceptual competency to find the ones that actually aren't men and so we can very quickly label this data set by men versus women and so once we've done that we can then have a look at some other projections to see you you know what else is going on in this data set and so one of the nice things about this kind of visual approach is it actually lets you find kind of things that might be wrong or things that stand out and in this case this bunch popped up and we're like oh there's actually other ways we could categorize this so this kind of conversation allows us to actually identify oh here are people with sunglasses oh I wasn't originally going to label sunglasses but now I think about it that's an interesting category to create as well so with this kind of dialogue we can do that and then so when we train the model again it's going to give us projections that are now a lot stronger than the initial ones and we can now train the next set of labels much more quickly and look at that you can see it's like really separated out the two groups incredibly well now right and so now if we say try and find a projection that separates men and women as much as possible then it's going to do a really fantastic job so we can first of all quickly label a few more and since it's already such a good model it's basically got nearly everything kind of nicely separated for us so we can very quickly add a whole bunch more labels and then we can grab it on the other side and hopefully we will find that they're largely women and we can go through and quickly check and it's like oh yep they are pretty much all women so now suddenly we've labeled you know four hundred of each and then we can say separate men and women and you can see oh it's actually pulled them totally apart so the idea of this is kind of to say like what happens when you spend a bit of time seriously studying what humans are good at and what computers are good at and then try to create a system that explicitly brings those two things together right so it's kind of like a different way of thinking about things so I want to talk more about this idea of kind of augmented machine learning but instead of talking about training one particular kind of model which is computer vision and classification models I want to talk about ways in which data scientists can work with computers to train models more generally and most of the researching atomies showing here actually came out of something called fellowship dot AI so thank you for the fellowship that IFL owes this is a program where folks who generally have PhDs or a strong numeric or technical competencies from other fields who want to get into a deep learning can do a fellowship through this program and generally it's to do some kind of research program and so this is an example of a research from some of the fellows optimizing hyper parameters for image datasets in first day I the starting point basically the idea is to say instead of using a hundred X compute or a thousand X computer or whatever that Kwok Li slide that Petera builders showed was what if we used one X compute and 1.1 X human right so you know maybe I don't want to pay Google for a thousand times more compute and instead I'd like to spend an extra three minutes of my own time trying to figure out a good set of hyper parameters so anybody who's done the first day I course already knows this picture very well which is how do you set a learning rate well rather than trying a thousand different learning rates and training a model for that time and giving Google five thousand dollars for the you know for that compute that you just spent why don't I do 100 mini-batches takes about 45 seconds and gradually increasing the learning rate and seeing so as a gradual increase of learning rate see what happens to the Loess okay so this takes generally less than a minute and you can immediately find the part where the slope is as strong as possible now here's a really simple system from a researcher called Leslie Smith from the Naval Research Lab the learning rate finder is a very simple but effective approach of finding learning rates that takes you an extra one minute of your time but avoids the thousand X compute so you can do this on your own GPU at home so in this research from fellowship fellows fellowship to their fellows they basically said like how how many things can we just quickly jump to right how much real kind of massive hyper parameter search do we need to do and so they try to various things for these different important hyper parameters and they tried it on a variety of different vision datasets and and these ones varied a lot and we specifically measured them according to because we wanted to do transfer learning and we said what if we started with image net so we actually created this metric which is the cosine proximity between the penultimate layer activations of a pre-trained image net network right based on these different data sets so we can actually say how similar is each of these to the data set that was trained on the model that we that we pre trained and so you can see for example the the draw dataset was not very similar at all but the dodo dataset was pretty tomorrow this is not dota as in the game this is dota as in a I think it's like a home internals photos or something like that so here's the result when you look at lots of different possible hyper parameters that you can't rune and lots of different possible values you can set them to across all of these different data sets this one here is basically all of the defaults that you get out of fast at AI plus the learning rate finder at the place where that slope is as steep as possible and then red represents things being much worse so this is like you know a totally different learning rate for the different data sets and green represents things being much better you can see that basically there's almost no spots which are significantly better than the defaults that you get out of the box with the first AI software you can get an extra you know 1.7 percent on one particular data set if you use one particular learning rate but other than that you know there's there's not even an opportunity to get an extra 1% out of this so all of this discussion about massive grid searches and stuff like that is kind of ignoring the fact that we actually now know a really good set of defaults that just work nearly all of the time for transfer loading right and generally speaking you want to use transfer learning it's very rare that you would want to start with random weights right there's somebody must have trained a model that's at least vaguely similar to what you want and in this case vaguely similar is not that similar at all like these are like these are pictures of medical images in the cancer data set or pictures of Amazon rainforest right and they're not at all similar to the pre-trained network we started with so actually you can train really good models really reliably basically without doing any special grid search or hyper parameter tuning or or whatever right so to give you a sense of where that kind of takes you we show this basic approach in in week 1 of a class where the people who take it have at least a year of coding background but not necessarily any ml background at all and after week 1 we say hey here's a forum red tell us what you did this week and so far that forum thread has 1067 replies and people kept basically pop up and say oh I live in Trinidad and Tobago and I built a classifier for masqueraders versus Islanders and I built one for cucumbers versus zucchinis and I built one for figuring out what City a satellite images of and I built one for figuring out which kinds of buses they are in Panama City and I built one for figuring out different types of baddack cloth and nearly all the time the the responses are I got approximately 100 percent accuracy on a holdout validation set like generally speaking nearly every student that does this after week one says the thing I tried work basically perfectly with a hundred images or two hundred images right so the kind of it researchers like to focus on what doesn't work yet which is very sensible right because they're researchers right they want to say oh here's something it doesn't work let's figure out how to make it work but our research is very much about like well what does work yet and actually a lot of things work just without much trouble okay more examples we you know we built something to find from satellite images houses in the state of disrepair and to give a sense of how good these models are it's very it's not at all unusual for people after week one of this course to say oh I discovered that thing I was looking at somebody did some academic research and this in the last couple of years and I've discovered I have a new state of the art so one student checked it out and is like oh I have a new state of the art on accuracy of Devon Gary text classification oh I just checked this out looking at spectrograms for environmental sound classification and oh it turns out that I easily beat the Spade at state of the art on that this person improved the baseline from the software they used for tune normal sequencing by 600% this is something that their lab was paying for so so like this kind of thing of like hey just use some defaults do some basic transfer learning use some simple human Augmented techniques like the learning rate and will very often give you extremely good results there are other things you can churn if you've got a particularly small data set you might want to do some data augmentation so does that mean you need to go through all these different data augmentation parameters like Google did with their Auto augment paper where they spent again I think they spent like millions of dollars of compute time to check this out well again it turns out actually with a bit of simple common sense you can do it extremely quickly and specifically you can do something that we call TTA Search TTA stands for testing augmentation and that refers to the idea that inference time you can try a few different augmentations of these types right and you kind of randomly change your image augment your image at inference time like five times and you average the predictions right and it turns out that if you pre train a model just a little bit and then you compare just using TT a different augmentation approaches so like not retraining the model at all just trying TT a with different augmentation approaches the thing you get back tells you the very nearly optimal training time augmentation as well so if you use this TT a search approach then basically on these data sets out of the one two three four five six different data sets we looked at one of them basically wasn't any better it was neither better nor worse one of them was a bit worse and the reason that so far 10 was worse was because the researchers in this case forgot to add particular kinds of augmentation that you'd actually need for so far 10 which is to add a bit of padding and to do some random cropping but all the other ones they got quite a significant improvement by using this approach so again it's like it's not a lot of compute at all it's just taking some simple little shortcuts right so when you put these simple little shortcuts together another group of the fellowship video fellows tried using that on the food 101 data set and they found they got unused that result now this is interesting because this datasets actually pretty widely studied and in fact the previous state-of-the-art result for top one accuracy was an ensemble network that was specifically designed to be good at food right now in this case the approach that they used took half the amount of time to train right didn't have any data set specific architecture or design or training or anything and it was a little bit better than the state-of-the-art and top one and a little bit worse on top five now one of the things they did here was that an inference time they actually used TTA which can really help a lot is one of these kind of underutilized little tricks another thing that they did to make this much faster so compared to the previous data Thea not only did they do half as many parks but it was also much faster because they use something called progressive resizing for those of you that followed our speed records on image net training you might have seen that we use that as well another really simple really obvious idea which is if you're training for 90 packs and you're looking at 224 by 224 pixel images why always look at 224 by 224 pixel images why not start at 64 by 64 pixel images because there's so much faster right and they'd like trained most of it at 64 by 64 and then do a little bit 128 a little bit at one money to just do the last few repots 224 by 224 so when you do it that way you'll find it can be like four or five times faster and you get better results the reason you get better results is that you've kind of introduced a new kind of data augmentation that is very hard to overfit when you're literally changing the size of the image that you put in right so like progressive resizing is another of these like simple little tricks you can do to get these kind of state-of-the-art results using a single GPU training for like an hour or so so basically they describe the training procedure in this blog post these are all blog posts from the plant got a Oh blog but basically what they're saying is we just use the same approach that first day I lesson one shows and the reason for that is everybody before they do the fellowship program takes the first day I course so they kind of just use you know that that simple technique there was a another interesting example which is some of the fellows looked at fashion classification and fashion classification is another area that is massively studied it's a very lucrative area pretty much every fashion retailer has fashion specific convolutional neural network models and again in this case the fellows outperformed the previous data yacht approaches and they even outperformed humans and in fact on some classes they even outperformed what they call savvy humans which is like fashion specialists so they actually compared the results to people that were specialized at fashion classification and again I kind of think in terms of things that like things that do work you can actually go a long way with simple heat maps right so when they looked at the the heat maps for their fashion model and said like which bits of the input are they actually looking at they found out you know the models are actually looking at exactly the appropriate areas and we find this in medical imaging as well all the time and I mentioned this one partly because like heat maps is such an incredibly easy thing to create there have been so many people in the research community recently saying like oh they're not all that and again like there are some specific particular cases if you look really carefully where it doesn't work so well but most of the time like it in the real world in practical problems I've never had a problem with MS they've always worked great also in the kind of like How to Train things quickly and cheaply everybody needs to know about one cycle so one cycle is another great technique from Smith where on the left here this is a plot over the over batches of learning rate and on the right is a plot over batches of momentum the basic idea is that you should warm up very slowly spend about 30% of your epochs in the learning rate warm up and about 70% on the cooldown and then feel your momentum do the exact opposite so when the learning rates really high the momentum should be really low and vice versa right this makes sense because at the start you've got all these random weights it's really hard to get started if you've ever tried to train neural networks really fast you'll notice that at the start it's really easy for the weights to and the gradients to kind of zip off to infinity or to zero so the trick is start your learning rate really low but with your momentum really high for those parts of the weight space where it has found a kind of a reasonable region with with kind of consistent gradients it'll get faster and faster which is what you want and for those parts of the kind of for those activations that haven't found a nice part of the weight space they'll just slowly bump around trying to find something better this works really really well and you can train things you know three or four times faster if you use this approach and interestingly before Leslie Smith created this reproach she actually wrote a paper on something he called the super convergence phenomenon and basically he realized that there are there were certain kinds of models and certain kind of situations where he could make the learning rate ten times higher that anybody had trained before and trained them as a result up to ten times faster but he had no idea why it was happening and so he kind of published his paper saying like here's an experiment that I did and there's this weird result and I don't know what's going on can anybody help me I want to point out that in most sciences this is how things work and experimentalist says I did an experiment you know I did this thing in the Large Hadron Collider and this is the result that came out I don't know what's going on and then all the theoreticians study the experimentalists experiments a lot and then they say oh here's what I think's going on here's some experiments you could run unfortunately in the deep learning community that doesn't happen right it's really really rare to do what Leslie did and publish a paper saying here's an experiment for which we don't have a theoretical answer and it's unheard of to my knowledge for theoreticians to actually look at experimental results and try and figure out what's going on so nobody has yet actually studied as far as I know super convergence luckily Leslie Smith studied a bit himself and figured out that this is how you can reliably get super convergence but I feel like there's for those of you that are interested in research one of the big opportunities I think is to follow closely experimental results and become a top experimentalist with the goal of then studying theory to figure out why surprising experiments happen if you're interested in this I have a long list experimental results which are entirely unexplainable at this at this point and if we could explain them then we could train neural networks a lot better another thing for like how do we get neural networks to train reliably and quickly when you start a neural network particularly if you're doing transfer learning some of your layers have random weights and some of your layers have good weights right so you should always train the random weights more and there are two ways to train the random weights more one is that when you start fine-tuning a network freeze all of the layers that are fine-tuned that's sorry that are pre-trained and only train the the random weights the other is to use a different learning rate for different layers so that the layers earlier on in the network have lower learning rates and the ones later on in the network have higher learning rates both of these are going to result in the later layers getting trained more if you do this these are some of the tricks that get you these like reliably good results like really really good results quickly the other thing I'll mention is in the academic research community vanishingly few researchers know how to train neural nets properly like this so in general if you come across a paper that says X doesn't work particularly if X is anything in the field of transfer learning take that with a very strong pinch of salt because it's very very likely that that researcher just didn't know how to make it work on the other hand if you say something that says see something that says the X does work then that's reliable right though they've definitely gotten it to work so there's been a lot of examples of papers in the last few years I've seen where people say here's a thing that doesn't work and we've checked it and it just turns out they weren't training their networks properly so be cautious of that you may think our intends are different but actually the basic tricks that work for computer vision and confidence also work for our lens so we actually did a big research study this is actually done by a silver guru who's our research scientist and he discovered that basically the same basic approach works for every data set we tried it on and that is to use atom W so another thing you might have read is atom doesn't work very well it actually does work very well the problem is that most people did weight decay wrong as you might be aware you can there's two ways you can do weight decay one is to do true rate decay which is where you actually subtract out of when you do the the weight update you actually subtract out a little bit of the weights as well the other way you can do it is by adding l2 regularization to the loss function in theory both of those two things are the same and practice they're not because you also have things like momentum and the atom gradient squared divisor in there so Frank cotta and some of his team dinner put out a paper a year or two ago pointing this out and pointing out that if you actually do weight decay correctly then you get fantastically good results and this is called atom W or in the new paper they've just changed its name to decoupled weight decay there is one trick to be aware of and this paper just came out in the last three or four days which is that if you're training for a lot of epochs with atom the last few epochs tend to go pretty badly and so there's a little trick you need to do and in in the paper the trick they suggest you that you should do is to clip the gradients so that you don't make the gradients too high basically what can happen with atom is that because you're dividing by the moving average of the gradient squared that that thing you're dividing by could be a really small number right and so therefore you could get some really big gradients and they discovered that you know there there are datasets where that can happen so you can either clip the gradients or something I've been talking about for a while is you can actually take that that epsilon parameter in atom and gradually increase that you can anneal that upwards I don't think anybody's published this yet but it it is actually something that's kind of mentioned in the tensor flow documentation which is that when they were training the inception network at Google they noticed that they could use atom if they set epsilon to some reasonably big number so if you haven't come across epsilon before there's a good reason for that which is in the original atom paper they said epsilon is a number that you should set really small like one in egg eight to basically avoid numerical instability but when you actually look at the atom equation it can do more than that if you set epsilon two like one then that basically means that a really small gradient squared number can never cause the gradients to blow out right so these are two ways you can basically avoid this problem and so with these two tricks which is basically atom W plus either clipping gradients or increasing EPS you can get networks to Train very very quickly and as accurately as as with non Adam approaches okay so if any way if any of those things sound interesting you should come along to our next course at the university of san francisco where we're going to be doing a dick dive into research level questions such as the kind of research questions that come out of this as well as doing a deep dive into some of the engineering approaches you can use to try to solve these problems and okay that's it so I've got to think I've got five minutes for questions if anybody has any questions to 24 yeah yeah so the question is can we handle image sizes other than the original image size that it was trained on in an image net or whatever the answer is yes it's actually very easy to do the before the penultimate linear layer of convolutional net is an average pooling layer in pretty much all modern neural Nets and if you have two to four by two to four import generally the grid size before that is seven by seven and so even the default pi torch models right now and I don't know if it's dual true of tensorflow but it was true of tensorflow for a long time those models also they have they basically have a seven by seven average pooling layer there and that means that that model can't work on different sized inputs but both Charis and PI torch have something called a global average pooling layer or an adaptive average pooling layer that says don't do seven by seven do whatever pooling is necessary to come out with a one by one result so like if you use the fast a library we do that automatically we convert every model using this approach into something that works on any size input so the answer is yes we already have a solution to that and and this is why progressive resizing then works and actually if you do progressive resizing when you train the original model then you actually end up with a model that is much more resilient to different sizes at inference at inference time so that works as well and then another thing that we've done is to also handle rectangular sized inputs and again this is something that if you use adaptive average polling you can also automatically handle rectangular sized inputs and that obviously makes a lot more sense than squishing everything into a square any other questions we switched from tensorflow to pi torch a year or two ago because at that time pi torch was just so much easier to work with particularly for kind of research level stuff with tensor 2 things are changing a lot so now tensorflow has eager mode which is one of the big differences for me I still find tensorflow is far more awkward to use and they kind of tend to answer the same question in seven different ways in different parts of the code base and so I find particularly for research it's pretty hard to work with but tensorflow is definitely going in the right direction but for me I still prefer using pi torch because I find I can get more done more quickly more reliably but you know things are they're catching up so you one more question yes yeah sure so the question is about putting faster a and plate torch code into production so we use Jupiter notebooks a lot because there's an interactive prototyping tool it's great for doing research and development and for learning it you can do a lot of you know a lot of your development there but then at some point you want to take that model and well there's a couple of reasons you might move out of the Jupiter notebook one is that you might want to kind of just train a much bigger version and you might want to kind of make sure that it it's it kind of reliably handles computers disappearing and stuff like that or do some kind of distributed computation the other is that you eventually want to take a train model and put it in production I actually recently created a new project called I don't know it here but scored first a c2 which handles the former piece it basically lets you train things on AWS instances really easily so I tend to kind of do most of my work in Jupiter and then when I want to train the big version of the model I kind of put it basically I'll go you know export it as a Python script and then run it through fast ec2 and then for production the vast majority of people I know do production using CPU compute not GPU compute because if you want to do stuff in production with GPU compute you need to kind of batch up all of the requests that are coming in into something into a large enough batch that the GPU is actually going to handle it well and so for most situations you know it's it's it's unlikely that you're going to be at a scale where that's actually going to give you the latency that your users are looking for so generally for CPU inference we see people just creating a normal like flask or nowadays a starlet you know a sink endpoint rest endpoint or whatever and just using standard where that stuff the you know it you can just horizontally scale it really easily if your kind of Facebook or Google or something then you definitely do have this scale to do gpu-based inference in which case you might want to use some of the kind of cafe to onyx stuff but I I don't personally know of anybody outside those giant companies that are getting value out of that at the moment okay okay thanks everybody [Applause]

Info

Channel: Full Stack Deep Learning

Views: 8,020

Rating: 4.9448276 out of 5

Keywords: ml, machine learning, software development, artificial intelligence, deep learning, ai, fast.ai

Id: hZd3X_nGdew

Channel Id: undefined

Length: 44min 43sec (2683 seconds)

Published: Tue May 07 2019