A Hybrid GA-PSO Method for Evolving Architecture and Short Connections of Deep Convolutional Neural

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so welcome everybody my name is Maya my hair I am the CTO of diamond V we are a tech startup we build the biometric recognition devices and today we're going to be talking about hybrid the GIP is a method for evolving architecture and shortcut connections of deep neural networks basically it can be automatically evolve a network instead of having to build it manually this is some research that was done by at by researchers at Victoria University in Wellington New Zealand and they're proposing that both the the architecture of a network and the shortcut connections in it can be evolved we're going to go into what that means a little bit later in the talk they're using a hybrid two-level evolutionary computation method combining two other algorithms particle swarm optimization and genetic algorithms if you have if you're not too familiar with them don't worry we're gonna get into it during the talk so our plan for today I'm gonna talk a little bit about motivations why why are we doing this why are why are the authors so why did I do this research to begin with then we're going to look at what our shortcut connections and some examples of that dynamic net is the network that was evolved with this algorithm in the paper then we're going to look at genetic algorithms and particles form optimization in the experiments we take a break and the discussion that we talked about and finally the results he take awaits and discussion points so why do we want to evolve neural networks to begin wait for me when I chose this paper it was partly a philosophical idea that it feels intuitively feels right to use something like evolution or something inspired by evolution if you want to some some time in the far distant future if you want to approach general AI then we should look at how do our brains work where did I brains come from and so if our brains have been through some kind of evolution then maybe our neural networks should also go through an evolution the authors are also making the point that deep learning has come a long way but it's very very application specific if you come up with a new use case then you might need to actually build the new build and training your model and sometimes that's not practical sometimes you don't have the resources to do that either and the second point they're making is that the shortcut connection seems to be a good idea there's good results which are good connections but do we really know how to design the best shortcut connections do we know which configuration is is the best one it's a difficult problem I also have an example from real life or why why you could it could be useful to evolve a network so in our work at M&T one of the things that we're doing is biometric recognition using data from from thermal sensors and we're doing that on device on our custom hardware so one of the issues that we're facing is that we have extremely constrained resources we need fast inference and it's not a common use case it's not a lot of people that do that so there's no really out-of-the-box solution for us we can take a model that's used for say image recognition and make adjustments to that to fit our use case or we could design our own architecture but the authors of this paper are saying that maybe there's a third option maybe you could actually automatically evolve an architecture very suitable for your specific problem using the HTA PSL so he wants one example of the under vice biometric recognition so one example of that is one device that we are using for for access control so it would be like door locks where we're using the thermal signature of your face as a biometric marker that's one one example no pram so let's look at shortcut connections and I'm not going to go very deep into this rest net if you want if you want more information on that I know there was a talk earlier in a is about it so you can actually go into the archive and have have a very nice introduction there but to give you the basic idea ResNet came about because when you have a convolutional neural network with many many many layers at some point you stop gaining an advantage it's the vanishing gradient problem that basically you stop learning any anymore but it doesn't help to just add layers so what they were proposing is that you add shortcut connections between every two layers these are these arches that you see here where you basically add the input of the first layer to the output of the second layer and what this does is that you are kind of ignoring the layers that doesn't really make a contribution and this improves the vanishing gradient problem there's another research group that figured ok circuit connections are a great idea so let's put them from every layer to every following layer but if you do that you're going to have an insane amount of feature maps in the end so because what they're doing here is they're concatenating the future maps from the first layer to every other layer after so that's why in their architecture they divided it into these blocks we called dense blocks and within each block every layer is connected to every other layer I'm going to have a look at what that looks like now between the blocks there are transitional layers that are reducing the amount of future maps so that you have these shortcut connections but you also keep it constraints you don't have an explosion of connections and future maps and need for resources so this is a look at one of the blocks in dense net here you see at five layers you see the future maps from the first red block there is connected to all the following layers and so on and the green one is connected to the yellow and the orange now every layer is always connected to the next following layer even in a regular commercial network so that's that's not we wouldn't call that a shortcut but shortcut is when you skip at least one there so the output of one block is a combination of all the future Maps in that black so dense net showed an improvement of a rest net and the authors are saying this shows that the shortcut connections are a good idea but do we know what is the what are how do we get the best shortcut connections and they're saying there's many there are many open questions about this because for example if you compare ResNet and dense net rest net is using in an addition operation after the shortcut so they're just adding the input to the output while dense net is using a concatenation now would resonate with concatenation actually be better than dense net and do we even know the answer to that so they're saying there many open questions like this so it's a hard design problem and that's why that's their motivation for evolving both the architecture and the shortcut connections they're calling their network dynamic net this this is the network that is automatically evolved using their algorithm now they're not using from scratch they're not starting from scratch not starting from zero they have to have some kind of framework for the evolution to make sense so they say they like the idea of dense net where you have the possibility of connections from between the first layer and every other layer and so on they want to have that possibility but they want the evolution to decide where the connection should be so if there should be if it should be connected to all the layers or just a few of them or even none at all so because they have potentially the same problem as dense net that you could have an explosion of feature Maps they're also adopting the same structure that you have these blocks so within each block you're going to evolve what kind of connections you have and then they will be divided by translational layers that will reduce the amount of future Maps so they're using the same mechanism that's not with the blocks they're using a fixed filter size of three by three fixed right of one and the automatic part is they're evolving the number of blocks in the network and the number of convolutional layers within each block and also the growth rate how how how fast you can grow as well as the shortcut connections yes it is just a way to if they if they have if they let it evolve without any graph right then you could have a huge variety of layers so if you the first generation of devolution if you if you have say two layers and then the second you have 20 layers then that gap it's it's hard to get a stable search that way so they have they have also a growth right that's that's they're also trying to find what is the best birthday basically as far as I understand each generation so maybe going from like two blocks to it five blocks or like increasing the amount of layers yeah or if it's going like from one block to block that they be learning growth rate is like smaller does it become more clear in the context of a genetic algorithm yes I believe so a little bit deeper into how the mechanics of the algorithm itself works and so maybe it becomes clearer once we get into that so they call their algorithm h GI PSL hybrid genetic algorithm particle swarm optimization which basically just means that they're combining genetic algorithms and particles form optimizations now these algorithms they had their time of greatness before really the boom of deep neural networks so you if you're in in machine learning right now maybe you're not using them all that much or maybe you are what do I know but but they were really popular like right before the explosion of these neural networks so I kind of like that they're bringing this together trying to use what's what we were good at back then - at what we're good at now and combining it into something might be awesome let's go into this so what are genetic algorithms it's inspired by biologic evolution the idea that you can search for an ideal solution inspired by the way that species have evolved by mutations by survival of the fittest and so on so the process is it starts with a mission at random initialization initialization of vectors usually a genetic algorithm is done on binary vectors vectors with binary values so you fill them with random values to begin with then you have a fitness evaluation and look at which of these random individuals are the best ones you do a selection based on that Fitness value but not the Fitness value alone and this is a major point in genetic algorithms that you're not necessarily looking for only the best solution because if you're always going for the best solution then you might miss something so if you think about that in terms of natural evolution like there's so many weird things weird mutations going on that that actually made some species more fit to survive right suddenly a fish grew like someone went up on land it's it's all these were things that maybe you wouldn't have have thought about to begin with at or maybe you wouldn't have seen it as an advantage so the idea is to allow for these unfit ideas and try to see okay maybe that actually in the future turns into an advantage so there is some weight on on the best individuals but they also allow for some randomness in then once you have your your individual selected you take some of them and perform mutation on them so take one individual and change parts of the genome in genetic algorithm language that means one part of their vector so basically you flip a 0 to a 1 or a 1 to a 0 here and there then you creating new individuals that way the other way other way other thing you can do is cross over you can take two individuals and combine the parts of their dream them together to become another individual you look closer at those operations so here you have an example of mutation you'll select the second and the fourth dimension of the vector and flip their bits so 1 6 0 1 turned into 1 1 0 0 and you can in the crossover take 2 they call them parents so the left parent here we take the two last dimensions of it and the right parent we take the two first dimensions of it and you end up with 1 1 0 1 as the child you could also have another child here which is the bits we didn't select which is 1 0 0 0 and I wanted to show you guys an example of genetic algorithms actually working so this is a bit personal to me because this is was my first meeting with machine learning it's a chicken robot that was developed at the University of Oslo right before I started there so [Music] and it's it doesn't have any neural network it doesn't have any really intelligence it's only using genetic algorithm to search for the most optimal walking pattern it's a bipedal chicken robot basically just want to take a couple of minutes just to look at that so you see in the beginning is just trying out gates kind of randomly it's not doing too well but it's trying and pretty soon its picking up speed now the evolution the evolution here is going real-time so it's trying out patterns and it's actually getting better pretty pretty fast at least now it can move and ever picks up speed so the other part now we covered the genetic algorithms any questions for that by the way before we move on so the particle swarm optimization is the other part of this algorithm and this is also biologically inspired it's motivated by how fish are schooling together or Birds are flocking you know when you see the birds moving together in the sky there all of them are individuals moving individually but they're kind of also coordinated so the idea of particles form optimization is to cover a search space by moving particles around that search space until it's basically found the solution you have a population of what they call particles which again just are lectures and each particle has a velocity in a position so the position would be the value that you're searching for you keep moving the position the particle to new position until it's good enough until some criteria is it's met before I go into the actual equations of that if you saw the paper you have the equations that update questions for velocity and position I just want to give an idea about what this velocity and position means so say you have two individuals you would usually have a lot more of course but just for them for this illustration so we have two individuals in a two dimensional search space so particular has only two dimensions and our individuals here are X 0 and X 1 and if you see just a notation here X 0 to 4 that means the current current time stamp of X the first individual is at 2 4 and you can see that in the grid here if you imagine 0 0 is in the bottom left corner then you see two at coordinate to four you have X 0 of the current time stamp like 0t now the velocity is how that particle that they're basically they how it moved to get there so if you see the velocity of 0 that is to 1 that means it moved to in the x direction and one in the Y direction so we can from that we can deduce that the X 0 at the previous time step was at 0.3 now they also have they also keep track of which which are the positions were the best ones so far you have two terms here you had the local best and you have the global best the local best means the best position each individual has visited through it's moving history and we see here the local best for the first individual was at 0 3 and the local best for the second individual was at 4 0 which also happens to be the global best so the global best will always be just one position but the local best you will have one for each individual so to update the velocity you have this equation here so you're saying that the velocity then the velocity of the eighth individual and the thief dimension of that individuals velocity so if you go back here you see the upper left corner of the velocity matrix here would be the V zero zero because it's the V the velocity of the zero the ice individual the cirrie where I equals zero and the zero dimension that what we mean by V I D then the T plus one means at next time step then you have a W which is an inertia weight and controls the effect of previous velocities on the current velocity and it is a number that you choose it's not a number that you allow and it's usually based on whatever works in the past so there's there are some customs as to how to set that then you multiply that with the current velocity you add also another parameter which is the C 1 so C 1 and C 2 is just how fast the particle is pulled towards the local or the global best so it can be given a little freedom to explore other spaces or it can be given very little freedom and we pulled fast people back towards the and the best so far solution then r1 is a random number between 0 & 1 PID is the local best of the individual and again the D stands for the dimension so it will be the for each dimensional vector X idea is the background position and PG D is the global best so far and then again the position and then once you have the velocity like we saw in the in the graph or in the the grid it's easy to find the next position it's basically just take the current position and add the velocity yeah our tool is also random number so r1 and r2 are two randomly generated numbers between 0 & 1 first I understand they change every update so you have one for each update so if we are going to go through an example update to get an idea how this works basically here we're just setting the values a bit out of the air so we're saying W equals 1 C 1 and C with 2 R equals 2 and we drew r1 equals 0.5 or to say equals 0.7 then we just go through the steps we say so we 0 0 which would be this upper left corner of the velocity matrix at the next time step it will change from being two to two point eight because we have a inertia weight of one the W then the current value is two we see that in the upper left corner of the velocity matrix then you have a c1 which is 2 times the random number that we draw which was her point five times the times the local best- the current position x plus the c2 that we said was 2 times the other random number which was 0.7 times the global best- the current position now you do this for each dimension of the vector and for this example we found that the next step of the velocity will be two point eight minus eight point six and then we can find the position as well so we just fill in the velocity that we had so we had two as a current position and we add two point eight then the next position is going to be four point eight and we had four for the second dimension minus eight point six is going to be minus four point six so we need to add to our grid here we add an eighth website and it's going to move all the way down there and we have our new point here so now you see we found the position for X at time step plus one we do not yet we do not yet know the position for the second individual because we need to do the Co questions all over some really goes when I was first talking to my I was also not that was very new to me like this particle swarm optimization so I guess the way I understood is that first I was like okay there's vectors and they're moving in a more optimal direction so I just thought it was like oh is this just crazy and descent but then after a hearing her yeah so one one analogy maybe it is suitable here so when you see birds flying each bird wants to fly in the direction it wants at the same time they want to stay together so there there is a compromise between the two and I guess that bait equations can express similar compromise that's a very good picture I think this on the one hand every particle is trying to move in there on explore in their own little way but then they're being pulled also towards the local best and the global best so then intention to fly in the same direction as a constraint yeah so what would be a constraint here soft constraint yeah you and that's what those c1 and c2 are controlling right you can control there how much you constraint each particle or how much you let it explore its base by itself so you could either pull it hard towards the local best or the global bust or just let its Explorer term also keeps it from maybe going too far but the random terms will keep it well let it explore silly nutshell so I guess you would would kind of keep it moving in the same direction if you let it if you let the previous velocity affect the current velocity a lot then you would kind of move in the same same direction speed yeah so yes so should we anyone have any more thoughts around this or we move on yeah each direction what each update is moving towards the right direction it doesn't you could have that the next update is worse than the previous one but that's why you always keep track of the local best and a global best the harder the different ways like say just for who's doing a random search research well you have this because you have this constraint that we were talking about that you can always pull it towards the best solution then you keep track of which one is the best so we know which totally random I'm sorry how do we know which one is best huh how valiant which which one is better that is a good question so this depends actually on what you're trying to do so what you would do you would have you would know what you're trying to optimize say you're trying to find like in the example of the chicken robot it's trying to move as fast as possible right so in in that case you would be looking for values that you will testify the vector or the the particle and you would see okay how fast did the robot actually move when I tried that and then you would go back and assign a fitness value to that particle saying yeah okay this was actually the best one so far and if it was then you put that as the local best and it was the best of all of all the particles all the positions that was ever tried better because in your level best yeah then it's my understanding correct if in terms of let's say optimum like on the flop of birds the optimization would be like to search for global and local to be close to each other you you just want to ideally you would want the the global you want want to come up with the global base that's your end goal so you want to come up with the position that is or the particle that is the best best solution but you also want to allow each bird to explore by themselves because maybe they stumble upon the right solution so maybe one of the birds that have been really far from the global best all all the way but then it suddenly actually comes over the right solution and it becomes the new global best because it explores his face that the other bird and birds didn't so for the hGEA PSL algorithm it uses two levels of evolution it uses the particle swarm optimization to evolve the architecture of the city of the accumulation of neural net and the reason for that is that particle swarm optimization is really good at continuous values it's and in the architecture we need to know things like like the growth rate that we're talking about we need to know number of blocks number of layers while the genetic algorithm is good at binary values and binary values we can use them to find the shortcut connection so we're going to look at the encoding that they're using for that a little bit later on it becomes more clear but for now just know that the PSL is used for the architecture and the non-metric algorithm is used for the shortcut connections so here's the encoding and to be fair the especially the second-level encoding here is a bit cryptic so we have another illustration of that on the next slide but the idea is that for the first level where you're encoding the architecture they have a vector that starts with the number of blocks in the whole network and depending on what that number is you have a number of blocks describing how many layers in each block and what is the growth rate so in this example you have three blocks and for each of the block there says but number of convolutional layers could be two in one and it could be eight in the other or so on then for the second level encoding you're encoding what are the shortcut connections within each block so there's one second level encoding inside each of the blocks and I'm not even going to try and explain it from this slide because it's just confusing but let's go see here so because you know that level 1 and is directly connected to level 2 and 2 2 3 and so on the layers that are next to each other are always connected we do not need to encode that we know that so what we need to encode is is layer one connected to level layer 3 yes or now here it says the first dimension of this yellow block here says it's a 1 that means there's a connection between layer 1 and layer 3 the next one thing we want to know is whether there's a connection between layer 1 and layer 4 here it's a 0 there's no connection between layer 1 and layer 4 lastly the last possible connection we can have is between layer 1 and layer 5 here it says 1 so there is a connection now when we move onto layer 2 there are less possible connections because there's no shortcut connection back again to layer 1 you always two layers following the layer that you're at so here there will be 1 you see there's a 1 in the first I mentioned that means there is a connection between level 2 and level 4 layer 4 and there's a 0 so there's no connection between layer 2 and layer 5 and finally the layer 3 it only has one possible shortcut connection it can only possibly be connected to layer 5 and because it says 1 that means there is a connection there so if you compare that to the dense net the difference here is that in the dense net there would be always a connection from layer 1 to layer 3 layer 4 and level fried but here you could have depending on how your evolution is going you could have a different configuration of shortcuts or maybe no chocolates at all for that matter if that solves the problem better more optimal and smaller well I guess this I know the idea is to to stretch for a solution where you don't need to have shortcuts everywhere I guess so yeah potentially but you could potentially also end up with exactly dancing that if that happens to be the optimal solution right so the overall algorithm it's start with initializing a population with random architectures I use the word architecture here instead of particle because it just makes more sense but this is initializing the the piece of algorithm with random architectures so vectors with this encoding that we saw on the top here and you do an update of a loop updated loops until you reach a good enough solution you update the Piersol to get new architectures using the updated questions that we went through and then whenever you have a new architecture you don't actually evaluate the fitness of that architecture right away because it doesn't make sense to evaluate the architecture and the shortcut connections separately because you could have an amazing architecture and an amazing shortcut connection but those doesn't does not necessarily equal an amazing network together so for each architecture we find the best short rib connections for that architecture so you have one new architecture you do a genetic algorithm update loop until it's good enough and here's where the fitness evaluation comes and so once you have a new you try out a new set of shortcuts you do you actually train that network for a few epochs and you test the accuracy of it and then you find the personal best for that architecture so basically you go through this genetic algorithm updates until you're satisfied this is the personal best the best shortcut connections for this architecture and then you go and compare it to whatever the all the other part of architectures came up with and you select the best one as the best network now if you go a little bit deeper into the view so part of the update so for each particle or each architecture that is you first update the velocity and position of the blocks so you take each blocks it has to be the blocks that has matching feature map size they're saying and they run the piece of update on those blocks first and then if if some random number that they draw is less than the growth rate of that block then then they will update the velocity position of the dimension of the number of the blocks so if the and then if the number of blocks are less than before then cut then they have to cut some blocks and when you cut blocks you have to start at the end why because all of the blocks here have a certain feature map size at the end now if you cut the first block you're going to change the input feature map size of block number two that's why you want to cut at the end because it doesn't affect anything that I have anything else and if more blocks are needed then you randomly generate a new block and you add it to two to the architecture or to the vector and then you go through the update again yeah the number of blocks is a set with as well when I was studying the paper and I guess they must be used basically rounding it up or down or something like this because it doesn't make sense to have five point five blocks like you say that's my best guess I think they're using it because they want to have the growth rate that's it that small value so that's why they need something continuous I mean you could for for at least four integer values you could could easily have done that with binary numbers number encoding right but but I guess for I guess it's for the growth rate it's my my best guess so like we said the Fitness evaluation is only done during the genetic algorithm process and the fitness of the best individual from the genetic algorithm is also used at the fitness in the PSL update this means the fitness of the world network is measured as the fitness of the architecture with its best possible configuration of shortcut connections so not architecture separate and short get separate they're working together so we need to give them one fitness value and each network is trained for five epochs using atom optimizer with that occupation they're choosing five epochs just two because their resources are constrained so they can't do this it takes a long time imagine you're going to train every network that you try your evolution is trying out you have to train it so they're saying with 500 folks we get an idea of how well this is performing with respect to the other individuals and they're also doing some optimization in terms of their training data so they're saying them to evolve the architecture is the most expensive parts they're only using part of the training data to train those those candidate architectures you could say so again that makes it less they give it less data and only run it for a few epochs just to get an idea which one is better and once they found found the best one then they train that network with the actual connections and everything on the whole training data set and output of the algorithm is the classification accuracy of that best network training so they did their work from getting to that in in a later slide am I talking about the experiment but they did use a few different data sets I think we're here actually so there they're using fairly small data sets again the resources are constrained they they said they would love to try it on something like image net but they just don't have the GPUs to do it so what they're using is the MB data data set which is handwritten digits from 0 to 9 twelve thousand training images and five thousand test images then is then the RBI which is the same as MB but with some added noise and then there's a convex data set which is images of convex and non convex shapes 8,000 training images and 5000 test images so if you compare it to these larger data sets it's it's fairly small but again it's to be able to run this within a reasonable time line yes yeah there's a brake slide very soon yeah so um they're experiments is limited by the resources they're using they're running it 30 times on these small data sets and then they wanted to try it on a bigger data set so they they ran it one trial run on the cipher ten one run of that takes more than one week so that's why they didn't not do that 30 times yet they're saying that during the trial run to figure out if it's worth throwing some more GPUs at it in the future basically and yeah that's the last thing that we have before the break really is that there's some parameters that they're deciding so again you can't just come up with random blocks unlimited then you're not going to get anywhere either so you have to also decide in which not just the basic architecture or what you're trying to evolve but also within certain limits so they're saying for instance the range of number of layers in each block is going to be between four and eight the rest of the growth rate between 8 and 32 supposed to be a growth rate since larger than zero and it's not insanely huge and the population size is set to 20 there and they're running it for up to ten generations and these are the values that the C 1 C 2 and W is the ones that we set to we I think we said the C 1 and C 2 2 2 they're using one point four nine six one eight and so on and you also see that elitism rate which might be the most interesting here in terms of the Jim the GA that's what we were talking about that it doesn't always look for the only the most fit individuals here it puts a 10% weight on fitness and the rest is basically random so we have a what we've been talking about so far is that dynamic net has the same kind of structure like the dense net it's separated by transition layer and has blocks of different types of shortcut connections inside each block it's a automatically evolved using the HD IPS cell and the HD is PSL algorithm is basically a particle swarm optimization to evolve the architecture the genetic algorithm to evolve the shortcuts within each block of the architecture and fitness is evaluated for the architecture and the shortcuts together so that's it's for you take a break I guess so we're going to take a look at the results and what they found on the small data sets that we were looking at before the convex m b and m dr bi is basically that it's doing a lot better than everything they compared it to so they said they had a they would put a - if it compared worse or equal if you compared the same or plus if it was significantly better and they're only putting pluses so it looks like it did pretty good like if you compare HG a pso2 say R and net - and the convex data set you'll see there's even even though the run a - or the PC ayna - performed good there the the HCA PSL actually performed better and so these are manually designed networks and the automatically evolved network actually performs better than them yeah I agree with you so this is a I I guess well there's PCA there's support vector machine but yeah it's a good point Hey but again again that's also has to do with the kind of data set that you're comparing because it's a small data set yeah so is this a comparison of an evolution algorithm with exactly and that's what they're saying that so they're showing their results over as is the statistical loop making over 30 runs comparing to sarcastic results from the other algorithms they also take a look at appear evolutionary computing network and it also seems to be doing better in that case now they mentioned somewhere that they are choosing to compare with that architecture and not another one big do two to complete computational resources again so it's possible that there the other one that they did not compare to actually performs better I don't what is classification rates I'm guessing there meaning accuracies or like nuts or because I know they are measuring accuracy so I guess that's what they mean by rights yeah yeah would be every right because before they were saying errors so I believe it's error rate to thank you right and they also showed the in the trial run in the cypher ten again it's only one run it's not 30 runs it's there's no statistic information here but in that run they achieved a ninety-five point sixty-three percent accuracy and they're saying the state-of-the-art range is from 75 points eighty-six percent to ninety six point five III so they're wrenching second and it's automatically evolved while the others are manually designed you do not need any specialized to my knowledge and there's no manual fine-tuning so that's the they're claiming they're huge is that advantage is the fact that it's automatically about Mikey T takeaways here is that even though the results are a bit unclear due to the datasets who are they comparing to and although this episode shows that a network and shortcut connections can be evolved and it has at least a comparable performance to other to to manually design their networks and you can do it without really having a deep specialize to my knowledge I'm I don't agree with them completely that you don't need under my knowledge because you still need to know what kind of base architecture you're starting from right but at least it's it takes some of the loda and you're taking out the manual fine-tuning which is a great advantage - so they did not compare to their regular desolate order no they did not they didn't actually I'm looking it up now against net 190 on c510 got 97.3 yeah yeah and then well that was in that was in 2017 oh no that's before and then actually the state-of-the-art now and papers with code is bit L which is some resin it's not important point which so what measures are taken so that you don't over heat on validation so accuracy is unbelievers but your your optimization algorithm uses this accuracy so motivation is not bad okay before we go to discussion points so so the summary of it that we cannot actually conclude whether are their architecture or their methodology is better than resonant but but yes and the point is it's not really to try to be better than the state of the art either as far as I understand of course they would want to be but they're just trying to see if you could make a to evolve automatically evolve a network that could be useful basically so it's in the range of yeah any figure to know the order of magnitude how many evaluations or epochs in total for all the candidate models so that some scalability yes you believe so like so your question is whether they give you the number of e-books that they are or the number of generations they're running or it's small I can trade it with I don't know one minute ten minutes see so how long will I be running this so what I remember is that they say for this I pretend that they took more than one week for the small data sets of my head but that was from what I understood is that that was more accomplishable on a single GPU and limited resources basically and and also another thing that we didn't talk about in the slides is that their their algorithm actually came up with pretty creative solutions so it had it could have a number of blocks where some blocks have like three layers and the next block has eight players and a lot of different configurations or shot good connections so it does a kind of explorative search or the pretty good solution yeah if they can't draw the graph of like the performance evolving with time if we I see a good trend then we know it's the idea is working but if it's fluctuating a lot you may have some thinking like when you started tenseness as a base are the really bad solutions so that's why I kind of want to see some they should a standard deviation so they all convert this in the world but yeah I think the point is there were using dance not as as the baseline the this is true yeah it's a good point too that their architectures actually working as optimization or is it no better than just trying randomly yeah so if you throw the trends you can see it if not so um we also had some some points that we were talking about when we were preparing that talk also what is the stopping criteria because they keep see they keep saying that you run it until the stopping criteria is met they're also saying someplace that they're running twenty generations or ten generations I believe and and is that enough have you explored everything in that time don't know I am not sure if they I was not able to find a place they say exactly what is the stopping criteria but I guess they're running it until they get the most optimal I Chrissie I guess another thing we're talking about is that okay the fact that they're able to do this on using small data sets they're able to get some results on it would that mean that it's actually more relevant for say smaller startups like us or other researchers who were not Amazon or Google or have huge huge resources I don't know like if anyone has any thoughts on that yeah I mean it is kind of nice to see it it is doable mm-hmm not Amazon scale because a lot of these presentations are there being yeah this took you know 20 trillion GPU hours that no reasonable person it seems intuitively like a good way to solve a very specific problem so the the fact that they were able to do some progress on their issues I guess it again it depends on on your problem I as well if you're if you are if your problem is that you need to do image classification on very high resolution images in color that then that necessarily would be computationally intensive even with this algorithm because you would have to train those networks that you're trying to evolve for all of these standards I'm not sure if it's a benefit in that sense but if you have a problem that that would be solvable with a fairly fairly light Network then you would be able to search for that network during this algorithm this algorithm I guess and another thing we're talking about like how does this relate to things like Auto ml are there similarities there now I don't know does anyone have thoughts on that I don't I haven't never used it I've looked into it very briefly once out of interest and I think it it treach's genetically fine like evolutionary fine multiple things one is what features actually to input into the model because it assumes you don't have a data scientist to the site where there are actually good features and then on Tabitha I think it does do something in architecture but they're not all neural network solutions I think or if they are like you know for Images he needs convolutional neural networks or something else we need RNN it's like it's not obvious what well yeah I think it does a combination of some of these things I think there's some similarities it means but for the the brief idea I have about it is that things like you saying tries to talk - number of things and maybe it's more about like you say finding features and finding which model to use then it is about evolving a specific architecture yeah which is searching for hyper parameters and they also use some terminology like mutation or like genetic but then their mutation might be like changing the high parameter isn't it in such a way or like adding them like nation but then I found that the ones that I had read about were also very computationally expensive like extremely like the original like Ottawa research like kind of family of papers they're all kind of like other companies and they're all just extremely expensive and then I guess after the explanation my mind it seems like this is more like they're they're not as brute force as like the typical Ottawa home and this is also also a good point that both genetic algorithms and particle swarm optimization start from a time where we didn't have these resources so it's kind of you could carry that search strategies that are trying to do the opposite of brute force is basically we don't have the resources to do brute force so can we go about the search a bit a bit smarter and so in that sense I guess it has smaller strategies to it and then I more before I search yeah sorry I was thinking something interesting might be like modifying your fitness function be like not just accuracy number of parameters yeah and then you get something that's you know optimized for accuracy obviously but find a network that can do it with the least amount of grammage like it maybe tune it to your you know production science if you're running on a device you can tune it down to just good enough for an advisor yeah my mind intuition says like you probably take most of the mutations training for long enough and then all do pretty well and I'm thinking something similar that's for if I were going to use this would also want to add something to the fitness evaluation as as in terms of how like hardware requirements for for importance right because if it doesn't infer fast on our hardware then it's not a good use case for me so but then it's not a good good solution for me so I guess you could do things like that with the fitness function to to kind of tune it into to get the solution that works for you so it'd be interesting to try yeah vena house that would be interesting to combine it with like compression yeah you know Network compressions where you try to take an existing model and see great this works but how much is this do you really need kind of thing that's a that's a that's an interesting and precious well I don't know if anyone used those I never have but then I guess they would if you were to do that I guess you wouldn't want you to do the fitness evolution after the compression yeah our once in a while take your best one compress it and see if it's good and start again from the ball so I I think still the question stands from how much evolution there is in optimization if you change the architecture slightly I'm not sure if the fitness of the model is a continuous function of the architecture that small changes in architecture rates to small changes commitments maybe it is like you make a small change and then you have a totally different performance I don't know that and the authors might have said honestly I agree that the others might have shed more light on several things in this paper but I still find it interesting I think the bottleneck with most of these approaches are is you have to do those five epochs and that's where both of your computation goes I was thinking like maybe someone has already tried it is it possible to subsample your data for subset as represented in so you your epochs are very much or that way you can evaluate them all those passed there that's I believe that's exactly what they're doing actually in this part here is that what you mean because what they're doing here is they're taking the training self and then they're like subdividing the training set again to get the adopt some strategy like an active learning for example they find ways to pitch hard examples for their models and use those because then it's an it's an excellent point because that part of the of the whole process system is the one that takes the most time that's why they can't run same part because they have to train all these networks so if you can up to my step part that would be actually be very helpful I think all right any last thoughts questions great I think [Applause] you
Info
Channel: ML Explained - Aggregate Intellect - AI.SCIENCE
Views: 556
Rating: 5 out of 5
Keywords:
Id: zQvuhWjH1GQ
Channel Id: undefined
Length: 72min 57sec (4377 seconds)
Published: Mon Jan 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.