MixUp augmentation for image classification - Keras Code Examples

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to the henry ai labs walkthrough of keras code examples keras has provided 56 code examples implementing popular ideas in deep learning this ranges from the basics such as simple mnist and imdb text classification all the way to cutting edge research ideas such as knowledge distillation supervised contrastive learning and transformers we'll also explore fun generative examples like variational autoencoders and cyclegan my contribution to these code examples is to explain every single line of code in each of them walking through each of the individual keras examples i'm not the author of these code examples please consider starting the github repositories to show support to the original authors the next chaos code example is mix-up augmentation for image classification authored by sayak paul who's one of the leading contributors to open source deep learning if you also have a twitter account please consider taking a second to pause this video and follow sayak on twitter so you can see his latest tutorials live from the source he also publishes on uh weights and biases contributions and other sources so i highly recommend following sayak to get updated on his latest posts and thank you sayak for contributing this example to keras code examples really exciting idea of implementing mix-up augmentation data augmentation is one of the most common strategies to prevent overfitting with deep neural networks data augmentation has mostly been successful in computer vision and image data where most if not really you know mostly kind of all of these augmentations really only apply to image data and what i mean by that is say rotations horizontal flipping translations increasing the brightness these data augmentations defined for images don't generalize outside of image data but mix up is a domain agnostic data augmentation technique we're going to display this idea with images but you can generalize this outside of images and that's one part of what makes mix up so exciting so mix-up is this idea of augmenting examples by taking two different instances from the data set and then randomly averaging together their pixels so this random weighting is this lambda parameter where you're weighting how much of each pixel you're going to influence as you average them out to form this new image so you take one dog image and one cat image then you average them together to form a new example a really great paper explaining this kind of idea of feature space augmentation and how this particularly generalizes outside of image data is modals modality agnostic automated data augmentation the latent space and this goes even deeper than mix up defining these other strategies for feature space augmentation but this is a paper that's really interesting if you're interested in this idea of data agnostic augmentation data domain agnostic and i i've made a video on this as well if you're interested in looking at that and this is also the paper for mix up that explains this high level idea and dives into the original experiments the idea is that we're also in addition to averaging out the original instances we're also going to be averaging the one hot label encodings so we take this average of the two images and then we take this average of the class label to form these new instances for training and for regularizing and preventing overfitting with our deep neural networks so this is one strategy to avoid spurious correlations with neural networks as psych states neural networks are prone to memorizing corrupt labels and they can you know spurious correlations they assign these different features with the labels and they have these high frequency relationships so we want to apply this regularization so they don't get overconfident about the relationship between the features and their labels this is a really interesting strategy for data augmentation and also again really exciting that is that it can be generalized outside of just image data and you can apply this for whatever data set you're working with the code example begins with our standard imports we have numpy tensorflow matplotlib and keras layers so we start off by downloading the fashion mnist data set from keras data set so again really easy to load data sets by using the datasets library just.lowdata now we cast the pixels into float32 divide them by 255 to keep them in the range of 0 and 1. reshape them for a compatible input for deep neural networks and then this is a new syntax that i haven't seen it's a really cool way of assigning one hot labels using tf.one hot taking our original wine tray y train labels which might be say like three six seven those kind of way of assigning the class label into these one hot encoded vectors using this tf.one hot functionality so then we do the same thing with our test set and then we define some hyper parameters with the tf.data auto tune batch size and our epoch count so this tf.data.autotune hyperparameter that we've passed in has something to do with the underlying structure of tf.data so science is an expert on building these data pipelines you can see some other tutorials on these tf records and all these different kinds of ideas maybe sayak wants to drop a comment on explaining more about what's happening here than what i understand but basically there's some kind of way of constructing these input pipelines and the auto-tune is going to just tell tensorflow to handle it itself i think so i think saying tf.data.autotune just means do your thing tf.data with respect to trying to optimize uh the batching under the hood of sending this data into our runtime so coming back to the code that was looking into this tf.data.autotune thing so now we're converting this data that we've loaded from our keras fashion mnist data set into these tensorflow dataset objects with the tf.data object so we have our validation samples we split some of our data for validation then we're going to construct two different trained dataset objects as we're going to be randomly sampling them and then shuffling them together so we start off by taking a new x train the new x train is just referring to the x train after we've popped away the validation samples and we construct these two identical different tfl data objects from tensor slices passing in our numpy arrays from our preprocessing into these two different tft data objects shuffling and batching them shuffling randomly organizing them and then batching them preparing them with this pipeline the data pipeline that we talked about with this auto-tune thing even though we don't see that we're passing in the auto yet imagine has something to do with this batching idea within the tf.data object so then we're going to construct our overall t of the dataset object which we're going to be using for sampling and constructing these mix-up augmentations by zipping together these two different identical tf.data objects so we have train ds1 and train ds2 and we're going to be concatenating them together to form this dataset object and we're going to be sampling from that when we're calling our up function and we also it's easier to do this validation set because we're not you know applying this augmentation to it so we just pass in the numpy arrays from tensor slices same thing with the test set so if you're curious about the syntax of tf.data.dataset.zip which is used here to concatenate these two data sets together you can see that it's used again here just the zip method is a part of this data set we see examples where you say construct a by having a range one two three b is four five six then you can zip them together so now we see as we list it we have one four two five three six so it's a way of zipping together these pairs of two data loaders so in our case we're cloning our data sets so we can sample one four and then mix them up together to form our new example but you can imagine maybe this would be useful for say contrastive learning or other kind of deep learning training frameworks like that so next up we're getting into the core idea of this keras code example defining the mix-up technique function so the first thing to note is how we're going to be sampling this lambda parameter to weight the averaging of x1 and x2 so say we have just a random sample we just randomly select between zero and one and this is just one minus the random sample of zero and one that might not perform as well as sampling from a distribution of values so in this case we're sampling from the beta distribution so the beta distribution describes these kind of probability functions of sampling our value for our lambda so compared to say a normal distribution which just densely concentrated around 0.5 which would mean you know having an exact average of the two and then the slight deviation say you know you have one standard deviation or so on but having these slight deviations from just having a standard average whereas the beta distribution gives us these different ways of flexibly combining these uh two different images we see the purple one with these two alpha two beta two is about it's not quite a normal distribution but kind of similar to that's kind of what we mean by the bell curve not like it's more like like that but anyway so we have this probability distribution and it's interesting instead of learning a probability distribution we sample one of these well-known continuous distributions and if you go to common probability distributions you can see things like bernoulli trials uniform distribution would mean just you know uniform would be just having a straight line across this sampling them all with equal probability and so on so we're using this beta distribution to sample this lambda parameter for weighting the averaging of the two different images to average them together in the mix-up augmentation so in order to use this beta distribution we're going to use tf.random.gamma passing in the shape is the size of our image and then we have the hyper parameters of the concentration of the distribution so we see tf.random.gamma the documentation tensorflow and you can play around with this by calling some of these you can store this in a numpy array and say do like plt.plot the numpy array to see the shape of these distributions if you're more curious about this kind of idea so i'm not exactly sure why there are two separate uh return samples that are called and they're normalized by dividing the first divided by the sum of the other two but anyway so we end up with our from this we end up with our lambda parameter for weighting these two images so next up we're defining the function that mixes up these two images taking in image one from ds1 as we zip these two datasets together and then the other image so and we don't actually use this alpha parameter i don't actually see this come up again so disregard this so first we unpack the two sets we get images one labels one from our sample from ds1 images labels from ds2 and then our batch size is the tft shape of images one so we could have say 32 of these images 64 so on as we're defining these batches and that's the overall shape of this tensor which is gonna be important for when we're reshaping our lambda parameter so these matrix multiplications are compatible for both the x averaging and the y averaging so then we get our lambda weighting parameter from l equals sample beta distribution batch size 0.20.2 maybe alpha would be uh should be passed in here but then we have the tf2 resha shape so we're reshaping the uh the lambda parameter so that it's compatible for a batch matrix multiplication with our image one and image two and then we also do the same for the y because we're also going to be averaging out the uh class label so we have an average of the two class labels which is going to be just um say you have one say you're mixing up in fashion mnist you're mixing up a shoe with a t-shirt and then and then the lambda parameter is like 0.2 it's just going to be a one hot encoded vector with two now uh positions where 0.2 and i think uh 0.2 would be in the other slot as well so instead of having just 0 1 0 0 0 now we have density in two different positions as we're forming these samples of the mix-up of the two different images so we have the shirt and then we have the shoe and then we also are normalizing the class label as well as the image itself so we project we reshape our x sub l under to uh multiply the image to average it out with the other image and then we do the same thing with our label and we return our new image label pairs sak also notes that we're combining two images to create a single one in this example we could combine many images we could combine this sweatshirt with the bag with the sneaker and so on we could combine many images but it comes with the increased computational cost so now we're going to visualize the augmented examples we're going to apply our augmentation by constructing this train ds mu train ds.map is how we pass in this function lambda sampling the ds1 ds2 as in how we zip together these two data sets and now when you sample from it it returns image label pair one image label pair two and then we apply this function of the mix up ds1 ds2 passing in the alpha hybrid parameter and then we have this other parameter on the number of parallel calls equals auto which the way i'm interpreting it means let the tf.data object optimize this under the hood so now we're going to loop through some of these again we sample the images sample the labels by iterating through this new data set object which is just image label pairs returned from this from our the return of our our mix up function then we're going to define the plt figure from matplotlib enumerate through our images and our labels define index these plt subplots to create a three by three matrix then overwrite each of the individual positions in the grid with the i index in our loop then we're all then we're gonna uh plt to i am show casting it to a numpy array squeezing it to pop out that batch dimension so it's just uh i think 28 by 28 or whatever the dimensionality is of uh fashion mnist i think these are rgb images as well and then we're um or no i don't think they're rgb images then we're going to also print the label so the label here is the interesting thing this is showing us how much each of these images have been mashed together so in this case see how this mostly looks like a dress or something like that we see how this is 98 dress or you know it's hard to even tell what this other 0.2 thing is and in this case in this one we see how we see the sneaker and the sweatshirt because we have 72 and you know 70 30 is probably an easier way of thinking about that then we see other examples of how these have been meshed together this is an example of just only one probability density on this one class and so on so maybe pausing the video and having a look at these these images and then the resulting label mix-ups will give you a better sense of how this is working so now that we've defined the mix of data augmentation we're going to define a simple convolutional neural network architecture mapping these input images into 16 features 32 features max pooling to reduce the spatial resolution dropout normalization and again another average pooling to reduce the spatial resolution then a fully connected layer with 128 neurons of radio activation and then finally a probability distribution over our 10 class labels in the fashion amnest data set so for the sake of reproducibility and comparing the model train with mix-up compared to without we're going to use the syntax of saving the initial model in get training model then we're going to have the get training model so get training model is going to be rendering this function so sorry i forgot that we're defining this as get training model and this renders the function returns the model so we're going to save the weights as the initial weights dot h5 and we're going to reload these weights when we're going to test it without the mix-up augmentation so first we fit the model with the mix-up augmentation and we see the training curves we end up with 86.9 percent accuracy and without the mixed-up data set we end up with 86.5 and as noted in as science knows at the end of this the fashion mnist data set both these models perform pretty well with it so it's not probably the best example of seeing how mix-up can really work and probably better to test this on more complex data sets and also i suspect this would be better for learning from limited label data sets as kind of an idea of data augmentation and the mix of augmentation and avoiding overfitting to spurious correlations for the sake of test set generalization psych ends the keras code example with some really interesting notes about experimental results with mix-up and what he's found in his experiment so with mix-up you can create synthetic examples especially when you lack a large data set going back to this idea that i expect this to work better with learning from limited label data this is an idea of regularization in the data space this high level idea of data augmentation so say you only have a hundred to a thousand labeled examples you're probably going to over fit to that using deep neural networks here's a great technique to try to prevent overfitting by creating these synthetic examples using data augmentation label smoothing is another way to mix up these y labels and assign probability density to the other class labels in this case label smoothing is where you apply uniform density to the other classes so say it's a shoe image you might move the one density on shoe into 0.9 and then distribute say 0.011 or whatever into the other classes so some kind of strategy like that and psych notes that this doesn't really work well with mix up also noting that mixed up doesn't work well with supervised contrastive learning supervised contrastive learning is another keras code example uh that we'll get into later i'm not sure if we yeah we haven't covered it yet in the keras code example playlist on henry ai labs but this is a really interesting technique of using contrastive learning with supervised labels so also noting that mix-up includes robustness to adversarial examples and stabilizes gan training maybe passing it in to the discriminator as having this other example of learning this data distribution and there are a number of data augmentation techniques that extend mix up like cut mix and augments two other papers to look into if you're curious about this kind of data augmentation so to summarize this keras code example shows you how to implement one of the cutting-edge data augmentations of mix-up and mix-up is really interesting because it's useful outside of just images you can average together any two data domains maybe also working in the embedding space the intermediate features of deep neural networks when say processing text data or other kinds of discrete data you might not want to blend them together together in the input space directly so this example shows you how to implement this cutting edge data augmentation using this interesting sampling of this hyper parameter from a beta distribution it's another really interesting part of this code example and then again showing you how to load the fashion eminence data set and overall building knowledge of understanding these computer vision applications and some of the tools that you have available particularly for data augmentation and when you're trying to fit these models with not that much labeled data so thank you so much for watching please check out the rest of the chaos code example playlist thanks again sayak for contributing this code example and please subscribe to henry ai labs for more deep learning and ai videos [Music]

Info

Channel: Connor Shorten

Views: 4,831

Rating: undefined out of 5

Keywords:

Id: hd3XEjfwuLI

Channel Id: undefined

Length: 16min 38sec (998 seconds)

Published: Sat Mar 13 2021