How far can we scale up? Deep Learning's Diminishing Returns (Article Review)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there i saw this article in ieee spectrum called deep learning's diminishing returns the cost of improvement is becoming unsustainable this is by neil c thompson christian greenwald kihon lee and gabriel f monso and i thought it was an interesting read because it talks about the computational limits that we're reaching with deep learning today and i have it over here in annotatable form though it might not look as pretty i think the article it leads up to the point where it shows just how much compute will be needed to make further improvements in deep learning and what the consequences of that might be and some of the ways that people are trying to get around it now i i don't agree with everything the article says but i think it's a it's a pretty neat read it's pretty short so i thought we can talk about it a little bit so the article starts out with essentially praising deep learning for achieving so many things for example translating between languages predicting how proteins fold and many other things playing games as complex as go they say it has risen relatively recently but it has a long history they mentioned 1958 and frank rosenblatt at cornell designed the first artificial neural network they say rosenblatt's ambitions outpaced the capability of his era and he knew it apparently he said as the number of connections in the network increases the burden of a conventional digital computer soon becomes excessive so why are deep neural networks working because of course computers have increased in power massively just for computing power there has been whatever a 10 million fold increase according to moore's law and that's usually just measured in something like cpu instructions and now we went even beyond that building special purpose hardware such as gpus which aren't actually special purpose for this but also tpus so they say these more powerful computers have made it possible to construct networks with vastly more connections and neurons and hence greater ability to model complex phenomena and of course these are the deep neural networks that power most of today's advances in ai they draw a comparison right here they say like rosenblatt before them today's deep learning researchers are nearing the frontier of what their tools can achieve essentially claiming that we are in a similar situation today we have the models that can achieve things and we know pretty much that scaling them up can increase performance however we're kind of at the limits of how much we can scale for example i reported on this that sam altman apparently said gpt-4 will not be much bigger than gpt-3 it will be trained more efficiently will have some smartness in it on how it's processed it will use more compute but it will not necessarily be that much bigger in scale so the first thing the article touches about deep learning is the fact that deep networks are over parameterized for example the noisy student model has some 480 million parameters yet is trained on only 1.2 million labeled images which is the imagenet data set now of course the noisy student model if i understand correctly also may leverage unlabeled data but granted today's neural networks are massively over parameterized they have more parameters than data points available therefore they should horribly overfit but they don't they say classically this would lead to overfitting where the model not only learns general trends but also the random vagaries of the data i was trained on deep learning avoids this trap by initializing the parameters randomly and then iteratively adjusting sets of them to better fit the data using a method called stochastic gradient descent surprisingly this procedure has been proven to ensure that the learned model generalizes well now i'm pretty sure that we are not yet sure why exactly deep networks don't overfit or why they generalize as they get over parameterized i know there are some proofs around sgd and so on but these proofs usually require assumptions that just make them completely lose touch to reality but the core message is true deep networks are over parameterized and that is probably one of the reasons why they work so well and being over parameterized they are quite flexible they say at the good news is that deep learning provides enormous flexibility the bad news is that this flexibility comes at an enormous computational cost this unfortunate reality has two parts they say the first part is true of all statistical models to improve performance by factor of k at least k squared more data points must be used to train the model does this really hold for all statistical models is this from the same theory that says the statistical models should overfit when they're over parameterized i'm not sure the second part they say of the computational cost comes explicitly from over parameterization once accounted for this yields a total computational cost for improvement of at least k to the fourth power meaning for a ten-fold improvement you would need to increase the computation by 10 000. now regardless of whether you think the theoretical analysis is actually accurate here again this is from the same area that says these models should overfit horribly it doesn't matter because these people have actually collected data and they say theory tells us that computing needs to scale with at least the fourth power of the improvement in performance in practice the actual requirements have scaled with at least the ninth power so when you actually measure how much people need to scale computation in order to achieve a given performance then it's actually it's much worse than the theory predicts in fact they have these neat graphs right here so on the left you can see the percent error i believe this is the imagenet classification data set and on this axis you can see the time now here you can see that over time as time progresses the error has come down and down and down again as new state-of-the-art models were proposed ever since the 2012 success of alexnet and if you extrapolate that you can pretty clearly see that around 2025 we should be at approximately 5 of error see i thought you'd had to actually do something to reach a new state of the art on imagenet but as it turns out we just need to sit here and wait until 2025. okay jokes aside uh they overlay this graph with another graph right here and that is the comparison of again percent error on the y-axis but now it's not the year in which the achievement was made but it is number of computations in billions of flops and notice the log scale down here now i have to say this graph right here makes it pretty clear that there might be something like a relationship even maybe a linear relationship that you can extrapolate right here i'm not so sure like these models are up here and then goes like here and then it goes here and then it goes here and then it goes over here to 2020 and really without that you probably have a line that goes something like this now in any case if they do actually the line that they're doing then you can see that if you extrapolate the same thing to this five percent error rate you do end up at something like 10 to the 18 flops and they also compare this to the equivalent carbon dioxide emissions for example right now we are somewhere between the co2 generated by the average u.s resident in one year and the co2 generated by the average u.s resident in a lifetime the current models somewhere in between to train them once if you actually extrapolate this to the five percent error rate to the 10 to the 18 flops then it becomes suddenly co2 generated by new york city in one month so the entire city of new york city for one month is the same as gpus go br to train imagenet now that is pretty shocking i have to say you know it checks out they have done the research they extrapolated correctly here and they come to this conclusion the co2 equivalents i'm sure they are measured correctly and so on i do have several problems with this though the first one i already said the zigzag in this graph right here doesn't really suggest that you can simply extrapolate over these advances also the 2020 point seems to be quite out there so if there was any architecture search involved if there was any giant pre-training involved or anything like this i'm sure like that that adds to the co2 emissions but it doesn't say that you cannot achieve the same thing with something else so whether the slope of the line is really the black one right here or more like the blue one i drew it makes quite a bit of a difference actually makes an exponential difference so i'm a bit doubtful that you can really pinpoint this five percent error point two five years in advance okay it's 20 22 now so three years but still and speaking of co2 equivalents not all energy is equal for example google prides itself in being zero emission therefore if google trains a model there is no co2 equivalent presumably now i think carbon neutrality and zero emissions and words like this are sometimes a bit of a scam but still not all energy is equal and especially these large companies they can distribute their workload across the planet to where the energy is used most efficiently and lastly and this i think should really the main point here is that we have made advances none of these achievements here that we've made over the past years are only scaling up the scaling up always came with some sort of invention that made it more efficient or more viable to scale up residual networks all of a sudden could scale to many many more layers because of the invention of the residual connection or the addition depending on who you ask so the residual networks became bigger and deeper without having to waste more computation in fact they had less parameters than many equivalent models of the time so i don't think we should neglect the inventions we make along the way in order to scale up now of course people are always going to put in whatever flops they have in order to achieve the best possible number but i think for most of these advances it was really new inventions that triggered the usage of these flops rather than the other way around and the authors of these articles actually agree a little bit they say is it really reasonable to extrapolate like this and extrapolating this way would be unreasonable if we assume that researchers would follow this trajectory all the way to such an extreme outcome we don't faced with skyrocketing costs researchers will either have to come up with more efficient ways to solve these problems or they will abandon working on these problems and progress will languish which is true so rather than being a warning cry about we're gonna waste an entire city's co2 emissions for a month for one model it's more of a warning against we're gonna have to come up with new methods and different ways of training these models and we can't rely on scale to bring as advances they also give some money numbers right here they said for example deepmind traded system to play go it was about 35 million dollars on cost when they trained alpha star they purposefully didn't try multiple ways of architecting an important component because the training cost would have been too high in gpt gpt3 they made a mistake but they didn't fix it due to the cost of training it wasn't feasible to retrain the model and so on and also mentioning that gpt-3 cost about 4 million to train now yes of course researchers that train these giant models comes with substantial costs so you have to think twice if you really want to do your grid search and whatnot so the experimentation methodology has become a bit different but also you have to keep in mind these big numbers 35 million dollars 4 million dollars and so on first of all this isn't really that much in comparison to what the people cost that worked on the model and second of all this is almost necessary all of the models that we see today have cost substantially more in the past to train but someone had to do it first i can only train bert today because google has invested ginormous amounts of resources trying out how to train it training the first one at considerable cost and only after that have other people jumped on prices have come down training got more efficient and now i can do it from the comfort of my home essentially on a collab or on my home gpu and isn't this the case with all inventions somehow at first it's just a few it's really expensive because it's custom because we haven't figured it all out yet and then over time cost will come down efficiency will go up and the easiness is just much better so rather than saying oh wow deepmind spent 35 million dollars oh no i'm like cool you know since they're doing this two three four years i will be able to do so for simply 2 million and hey you know so the article gives some solutions to that different avenues though they are mostly a little bit pessimistic about most of them so first of all they said you can use specific processors designed especially for deep learning now the newest generations of gpus are actually a little bit tuned to deep learning but there are also tensor processing units and there are a number of other hardware vendors that try to get into the space of specifically building chips for deep learning what they criticize here is the fact that this hardware has to do trade-offs they have to increase specialization for generality and also with specialization you face diminishing returns and of course the more specialized you are the less you can invent new things because you're essentially locked into what the hardware can do they also discuss training networks that are smaller but they criticize that often this increases the training cost because you essentially train a big network and then you train again to make it smaller to distill it and that's also not the solution to reducing training cost but it might be a good solution if a model needs to be trained once and then largely runs in inference mode such as gpt-3 they also discuss meta learning where you essentially train a good initialization for a lot of problems and then you transfer that initial solution to new problems so if you have a good meta learner they will be at an excellent starting point for solving new problems therefore reducing the training cost in each of these new problems but they also mentioned that and i agree meta learning is yet at the stage where it doesn't really work the training you put into the initial meta learner doesn't often pay off to new problems yes it works in papers but in papers you already know which other problems you're going to measure it on so they say even small differences between the original data and where you want to use it can severely degrade performance now they also mentioned this paper right here benjamin recht of the university of california berkeley and others have made this point even more starkly showing that even with novel data sets purposely constructed to mimic the original training data performance drops by more than 10 percent now i want to highlight this a little bit because this talks about a paper called do imagenet classifiers generalized to imagenet this is also usually called imagenet v2 because what these authors did is they try to follow the protocol of the original imagenet data collection as closely as possible and come up with a new test set the so-called imagenet v2 it's not a trainings that is just a test set and they show pretty convincingly that for any classifier that performs in any way on imagenet v1 its performance on imagenet v2 will be something like 10 points lower it's a fairly straight line so this is what the article talks about however the article doesn't talk about this paper right here called identifying statistical bias in date set replication by mit and uc berkeley which shows pretty convincingly that there is in fact a difference between the data collection mechanism of imagenet v1 and v2 it is a subtle difference but there is a difference nonetheless that difference makes it such that there is a significant difference in what kind of images are chosen for the two data sets and when you correct for that difference then this drop in accuracy for imagenet v2 almost entirely vanishes now okay the article is right in first instance there is a small difference between the original data and the new data and that severely degrades performance but this particular difference in performance is due to the new data set having a different methodology and that directly makes the samples harder it's not like the samples are different in some sort of a they're different kinds of images is that very directly because of how they collected them they are more difficult to classify it's the same data but more difficult so we shouldn't be surprised that performance drops by 10 in this particular instance i just thought it was interesting to mention since the article specifically focuses on this paper right here and i don't think this paper is a good example of what they're trying to say okay so what's the conclusion to all of this here is the final recommendation that the article makes to evade the computational limits of deep learning would be to move to other perhaps as yet undiscovered or under-appreciated types of machine learning and of course what they mean is that they want to bring the insights of experts which can be much more computationally efficient and that we should maybe look at things like neurosymbolic methods and other techniques to combine the power of expert knowledge and reasoning with the flexibility often found in neural networks now why does every discussion about the scaling of deep learning always end with well we should use more expert systems and reasoning and logic and the neural networks don't understand anything now granted it is okay to suggest this it's probably a good way forward but as of yet as of now the neurosymbolic systems or actually just the expert systems as well they are so so not good and of course that's the case with any young research topic but just because something is computationally efficient it doesn't mean that we should switch to that because of it now i'd be super duper happy if symbolicism makes a comeback if we could somehow combine algorithms and deep learning if we could combine reasoning and knowledge bases and input from domain experts and all of this but as of today that is not really a benefit it's more like a substitute so you can make machine learning more efficient by inputting lots and lots of priors from domain experts that's completely cool but what we've seen over and over and over again is that as soon as you give the ml system enough data it starts to outperform these experts and i think what i'd like to see from a neurosymbolic system or anything like this is that in fact it does outperform even the most data-hungry machine learning methods that the symbolicism is not just a substitute for more data but an actual improvement over any data that i could find and that's just something that i personally haven't seen you might disagree but i haven't seen a convincing argument yet that that is the case for any of the symbolic systems we have today computational efficiency alone is simply not enough but hey tell me what you think what do you think about this article do you agree with them do you not agree with them i'll link the full article in the description give it a read if you want and subscribe i'll see you next time bye

Info

Channel: Yannic Kilcher

Views: 19,326

Rating: 4.9513974 out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, scale, co2, gpt-3, bert, language models, environment, large scale, large language models, deep neural networks, transformers, imagenet, datasets, language modeling, training cost, openai, microsoft, google, google ai, facebook research, transfer learning, meta learning, exponential scale, overparameterization

Id: wTzvKB6D_34

Channel Id: undefined

Length: 20min 26sec (1226 seconds)

Published: Sat Oct 02 2021