Deep Learning for Tabular Data: A Bag of Tricks | ODSC 2020

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Jason McGee and I've been working on applying neural networks and deep learning specifically to tabular data pretty much non-stop for the last eight months and I'd like to share what I've found to be important to success neural networks and deep learning are pretty popular topics right now but the research papers libraries and applications I see around are generally related to attempting to push the envelope on performance for problems related to images audio video or natural language but there are so many data sets that are just numbers and categories and maybe snippets of text based on the disparity and coverage on the topic it might come as a surprise that neural networks can actually perform well on tabular data sets now there are probably some calculators out there saying yeah but we have boosted trees for that kind of data and yeah boosted trees can often perform really well on tabular data sets but they're not always the best tool for the job sometimes neural networks are and if nothing else inappropriately designed and trained neural network can it really help to diversify an ensemble of models which often leads to improve performance so how does one appropriately design and train neural networks for tabular data well I've put together a list of key steps that I've found to work well and it all starts with proper data preparation followed by design training and assessment and only then tuning our network if we want to build the deep network we shouldn't just try training some random deep architecture we need to intentionally choose key aspects of our network and only then if necessary build up and out let's get started so we're talking about datasets that may have discrete and continuous variables if we have missing values in a text or categorical column we can just increase the cardinality by one or make a new category to represent a missing value numerix are a little trickier but not by much our approaches for each column impute missing values by taking the median and adding a new column representing whether we impute it or not so you might be thinking hold on so if we had a data set of numeric columns and each had a single missing value we double the width of the data set and that is the approach in some cases it very well may be a better approach to simply drop the missing values or only impute them rather than adding additional columns but something to note is that if performance is the main concern if you're working with categoricals especially those with high cardinality the number of columns will likely dwarf the extra imputed value marker columns fortunately we can represent our data with compressed sparse matrices if the concern is more along the lines of the curse of dimensionality neural networks actually seem to be able to perform well on very wide data sets now that we fixed our missing values problem we can move on to transforming our inputs to play nicely with neural networks there are a number of ways to deal with categorical variables like encoding based on frequency encoding ordinal II making an embedding and encoding based on the correlation of the target and one hot encoding personally I've found one hunt' encoding to be a reliable way to get strong consistent performance there are a number of ways to deal with text some really strong approaches include encoding with universal sentence encoder feature izing with a pre trained model like burt performing tf-idf and stacking on a specialized text model surprisingly unigram tf-idf often performs quite well and reliably in addition to help putting a sparse matrix we generally don't need to do any special transformations to numerix but normalizing all inputs to be evenly distributed with a negative one to one or zero to one helps keep gradients more uniform and helps to aid in convergence and reliability across data size visually we can see the effect of normalizing our inputs such that they're more evenly distributed between bounds instead of however the data set started when discussing how to find success in designing and training a neural network it's important to understand how the decision is made to choose one hyper parameter over another or choose one to use batch norm a dropout or both we don't just open a jupiter notebook and run head-on the data frame and clean oh yep we're gonna need batch norm and a learning rate of one a neg three no we need to train a model we need to remove a random subset of the data before training and use it to validate our findings and that subset it matters which rows are in random is all well and good until we randomly pick a biased validation set and all hyper parameters we chose are over fit to a biased sample so instead we use a combination of holdout and cross validation to ensure we can design and train the best model we can and come deployment time know which model perform best on the subset that none of our models or we had seen before so we shuffle our data remove some percentage of the data to use as our holdout I tend to use 20% and then fold our data into some k number of parts I often use 5 now we can treat one fold as a validation set as we train a model on the other K minus 1 folds and get a loss or metric score we do this on each of the K possible data sets and average or losses or scores to get a non-biased result of how our model performed on the data set now that we have confidence in the feedback we received when training our model we can move forward and we're so close to being able to actually build our model but there's an important step to kick off before we do we need to establish a benchmark we need to understand what kinds of models tend to perform well on data and what kind of scores they get on our metric one approach would be to just run a bunch of models maybe make an ensemble using the data we prepared for a neural network but performing specific pre-processing that tends to help certain models or doing any planned feature engineering now we'll define a strong benchmark to compare our network to using grid search or other hyper parameter optimization approaches we'll also strengthen this benchmark once we have this we'll have a much better idea of how well our neural network is performing a better understanding of fit more on this in the assessment section and some candidates for ensemble all right let's get started with network design a major part of building an appropriate neural network for a specific data set and problem is tuning various hyper parameters this is going to be a lot more difficult if the network is unstable or as trouble converging a great way to combat this is to start with a low capacity network so instead of jumping right into attempting to train and deep and wide model start small maybe try a single layer with 64 units starting with this structure makes it easier to get a network to provide a reasonable baseline that we can build from the key here stability so to that end adding a skip connection from input to output tends to help with this in addition to helping find linear relationships between features an important thing to keep in mind is that unlike residual connections and convolutional networks are layer shapes will likely not line up but to solve this we can represent a skip connection with a fully connected layer should have the same size as the sync layer and use a linear activation function we don't want to apply a non-linearity or squash the output as that defeats the purpose at this point we can take the sum of all inputs to the output layer and pass it through an output activation and what should that output activation be fortunately there are some reliable ways to determine the output activation and loss function so let's get classification out of the way is it's pretty straightforward generally speaking for binary classification will want to use binary cross-entropy as our loss function and sigmoid is our output activation that is solving for the probability that our prediction should be positive for mutually exclusive multi class problems we'll generally want to use a sparse categorical cross entropy as our loss function and softmax as our output activation as we're solving for the probability that each class is positive and all probabilities must add up to 1 for independent multi class and multi label problems will want to use binary or categorical cross entropy as a loss function and sigmoid is our output activation as we're finding the independent probability that each class prediction should be positive these are by no means set in stone but using cross-entropy is useful as it punishes our model not just for being wrong but depending on how far the probability is from the ground truth there are alternatives like hinge losses focal loss and more which can definitely be considered as an extra note using different loss functions from members of an ensemble can be useful to improving performance so regression to choose our output activation and loss function we need to look at the data specifically our targets by sorting and bidding all target values we can choose the output activation and loss function according to the target distribution our MSE rooming squared error is a great fallback so when in doubt we should use our MSE but look out for exponential and inflated distributions if our target is 0 inflated will want to use tweety if our target looks like a Poisson distribution we should probably use Poisson if it's an exponential distribution ARMA silly and gamma are good bets Google actually released the paper a general and adapter robust loss function which attempts to determine this all automatically and learn the appropriate loss function during training which is really cool I love this idea and I think that this is the direction we should be moving towards on to the output activation it's a bit simpler if it's an exponential distribution or we have targets of different orders of magnitude we should use an exponential output activation otherwise use a linear activation the main idea here is to encourage convergence by mitigating disappearing or exploding gradients and just generally keeping gradients more uniform in magnitude similar motivation as to why we normalized all of our inputs all right so we have our output activation and loss function what about our hidden activation we have a number of choices sigmoid soft plus tan CH swish mich there are so many relly was pretty popular right now and usually what I see in people's projects and on forums rel you that is rectified linear unit is just if the input to our activation function is less than zero set it to zero otherwise return the input we kir Lu which is just multiplying negative values according to some small positive fraction I specify instead of setting them to zero can also perform well but requires hand tuning personally I found that Prell U and Lu that is parametric rel u and exponential linear unit tend to work well probably was similar to leaky rel you mentioned earlier but the Alpha which multiplies negative values is automatically learned by the network the main idea here is let the network learn the appropriate non-linearity Lu automatically learns a smooth version of how negative value should be squashed so we have our baseline network designed let's move on to discuss our approach to training when training our network the amount of data or percent of our data that we feed into the network each iteration has a major impact on how and what our network learns if our batch size is too small it's difficult to converge to minima as our gradients will dramatically change each iteration and we'll end up jumping all over the place without making meaningful progress in short we'll be under fitting training will also be very slow if our batch size is too large it will tend to converge to sharp minima which leads to poor generalization and also requires more RAM it's clear that a happy medium is needed a nice balance between small and large batch sizes after testing many batch sizes across hundreds of datasets from hundreds of rows to tens of millions of rows with a wide range of widths an optimal batch size appears to be a function of the total size of the dataset a good rule of thumb seems to be set the batch size to be 1% of the data set size with grid search we can often find a better one but when getting the first model going this is a great place to start I want to know this is without weight decay and introducing or increasing weight decay generally aids convergence for larger batch sizes [Music] now that we have a batch size chosen we can determine our learning rate we'll be making use of a very successful learning rate policy called one cycle invented by the researcher Leslie Smith but first we need to determine the learning rate we'll be using as a basis for it one approach that provides a good starting point for a learning rate which once again can often be tuned to get small improvements is called the LR range test the idea is start from a very small learning rate and exponentially increase it to a number larger than one tracking the loss or a metric score over time usually it's done over one epoch or so often a bit of smoothing is helpful to interpret results the idea is there is usually a consistently decreasing section of the loss in the plot and a good learning rate will be in this section this point is often a tenth to a twentieth of the learning rate at the minimum loss I've seen some different approaches to determining a good learning rate that use binary search to test different learning rates but the LR range test generally works well now that we have our learning rate we can build our one cycle learning rate policy the learning rate we chose will be at the maximum learning rate we'll use while training our network so the main idea here is we'll first warm up our network starting from some fraction of our maximum learning rate usually one to twenty percent I've heard a number of theories as to why warming up is necessary but the one that makes most sense to me at least when using Adam is that we need to properly initialize both first and second moment estimations that is decaying averages of past gradients and past gradients squared if we use too large of a learning rate we'll get crazy gradients that aren't representative of the local shape of our cost surface and make us take poor steps which isn't helpful to convergence I found that spending about 25% of warming up tends to work well once we've finished warming up we'll decay are learning right back to the initial learning rate over the next 50% of training using a high learning rate allows us to quickly arrive at a general area of our cost surface which will enable us to converge to a good minimum but if we keep using a high learning rate we'll overcompensate based on our gradients and fail to converge once our learning rate arrives back at our initial learning rate we can spend the last 25% of training warming down that is reducing the learning rate to one to ten percent of our initial learning rate which helps define tuner weights and - good minima now we could vary the learning rate linearly or according to some other function but we use cosine as it spends more time near the values we're interpolating between and less time on transitioning which means spending more time at low learning rates warming up more time at high learning rates when we're training at our max and more time at the lowest learning rates when we're warming down which seems to improve convergence if we're going to build this one cycle learning rate policy we need to build it with a specific number of epochs so we need to choose what that number of epochs will be if we're using a constant learning rate we'll just pick some arbitrarily large number of epochs and early stop based on a validation score but this won't work using the one cycle policy when we're choosing how many epochs to fit our one cycle policy - we want to pick enough epochs so it doesn't under fit but not so many that we over fit to start with only a few epochs and then assess her performance and should we deem it necessary make use of its strategy such as binary search to find a better number of epochs if we start with four and then try eight and it does better try sixteen and it does worse so we try 12 etc so we've mentioned tuning and assessment a few times now let's talk about how we'll approach assessing our models performance earlier we examined our bin target distribution to determine which loss function and output activation our network would be using assuming we're working on a regression problem it turns out another great application for this is to periodically generate and bin predictions on our training data and our validation data and track them over the course of training if we compare these bin predictions to our bin targets we can get a sense of how well we're capturing the distribution which is a lot more information than an RMS e score and complements it nicely looking at the distribution of our bin targets over time can help to provide us with insights like how quickly aspects of the distribution were captured and if any we're forgotten but we should also track our metric or loss over time for both the training and validation data we can look at the disparity between the training and validation loss often referred to as generalization error which helps to instruct us and how well our model will generalize to out-of-sample data not only will we have this information for each cost validation fold but we can keep and compare this information whenever we're tuning a hyper parameter changing how we're pre processing data or doing feature engineering or selection we generate a lot of data during training so let's leverage it to inform ourselves as deeply as possible and take intuitive steps to improve our network if we're working with classification data we can use BIM predictions in the form of probabilities and confusion matrices which we can track over time and between our training and cross-validation bolts this has the added benefit of understanding which classes tend to be confused with one another for the training data folds and each cross-validation fold the bin probabilities help to explain the models confidence in class specific predictions [Music] at this point we have scores and either confusion matrices or target distributions over time and should have a good sense of fit but it's relative so a really valuable next step is comparing these findings with each model in our established benchmark which models are beating the neural network is important to consider before tuning are they middle of the pack worse than linear models worse than true based models worse than 'xg boost topping the benchmark other than ensembles including ensembles if we get a sense of which models are performing better than our network it can help to inform what steps we should take if linear models are beating our network something is likely seriously wrong with a skip connection we should be able to recognize linear relationships if this is the case we should remove our hidden layer and consider walking through our design steps again reconsidering our output activation and loss function but it is still worth looking at confusion matrices or comparisons and prediction and target distributions middle-of-the-pack may be a more reasonable first step we can probably proceed to tuning performing worse than trees specifically may be an indicative of a discontinuity like artificial clipping in the data or multimodal or overlapping distributions these are things a bit more difficult to deal with for neural networks but there are definitely things we can do one option is stacking with a tree model which means will incorporate the predictions of a tree based model into the input feature set of our network a second option is manually setting boundaries probably by observing splits of a tree based model and training a separate neural network for each split a third is training one network freezing it running predictions on all data and removing the samples that perform best on from the training data then repeating this until we fit the data the best we can finally training an ensemble model on the original data set which takes the output of each frozen Network this might be a tree based model or neural network intentionally overfitting these sub networks can often improve performance the main takeaway here is consider how other types of models perform on your problem and incorporate what they do well a single model or model type isn't always the best option even if our model tops the benchmark likely still improve it with tuning so at this point I've talked about neural network this and all Network that but not deep learning fortunately some simple first steps in tuning can be related to architecture and increasing the capacity of the network what happens when we widen the network something like 1024 units what happens when we increase the depth maybe 2 or 3 layers at 64 units when we make it simple 8 units from here we can generally form a strong idea of the direction we should be heading it's important that we rerun the LR range test whenever we change the architecture batch size or other hyper parameters as higher capacity networks can often have more chaotic loss surfaces which might explain why using smaller learning rates tend to help if our capacity is larger we often need to run more iterations so either epochs or a smaller batch size but smaller batch size means more regularization which can limit convergence if used in excess for classification and especially deeper networks and depending on the problem introducing batch norm can be a technique which may dramatically improve performance for regression especially just before output batch norm seems to mess with predictions and can be detrimental to performance so use caution for a more formulaic approach to tuning if our generalization error that is difference between training and validation loss or metric score is low we may be underfitting conversely if our generalization error is high we may be overfitting if we're overfitting we likely want to start with regularizing if this generally should be in the form of increasing learning rate decreasing batch size introducing weight decay running fewer epochs reducing network capacity or introducing dropout if we're under fitting we likely want to either increase the complexity of our network reduce regularization or just run our network for more iterations [Music] you [Music]
Info
Channel: DataRobot
Views: 6,715
Rating: undefined out of 5
Keywords: machine learning, AI, tabular data, deep learning, ODSC 2020
Id: WPQOkoXhdBQ
Channel Id: undefined
Length: 21min 45sec (1305 seconds)
Published: Tue Apr 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.