Anomaly Detection using Autoencoders

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

morning everybody and welcome to day two of hike on our first two speakers are Angela liking and Lady moody say and they are from a major telco and there are data scientists and are going to be talking about anomaly detection using also encoders hello okay that's good morning everyone today we'll be sharing a talk about anomaly detection using what encoders I'm melody moody say hi I'm Angela and we are both data scientists obviously at a telco and this is also our first time speaking at PyCon yeah so just to take you through some of the content we'll be sharing with you today we'll give you an introduction of auto-encoders how the algorithm actually works a brief history about them thereafter we will give you some architectures different type of popular architectures of autoencoders which may be useful for you for your use cases also particular use cases that are out there where auto-encoders are good to solve we'll also be sharing some Python packages popular Python packages that you could use we'll take you through at repeater notebook we will introduce the notion of fraud anomalies and how to actually implement that then right after we'll have a quick sense visualization to show you how assets data scientists interpret the results as well as for business stakeholders then lastly we'll be sharing key takeaways from our experience from implementing this type of problem so how many of you are aware of neural networks I'm sure most of us were there yesterday at Alex's talk so I'm sure you are familiar with convolutional neural it works I mean he went into quite a lot of details feed-forward neural networks recurrent neural networks and all these type are solving for particular problems like computer vision machine translations and so forth Auto encoders are a part of the family of neural networks so yeah so as milady mentioned before right Auto encoders are a type of neural network whose goal is to determine an output based on a similar input so as you can see right the goal of the input data is to be compressed so that it's in a lower dimensional space such that when the decoder comes along it takes that learn representation of that data the pattern such that it's able to replicate this learned image of this mushroom so now just to get a bit more in depth in terms of the algorithm of the auto-encoders so it's split up into an encoder and a decoder the encoder is simply just a function of your input and your decoder is a function of your hidden layers now as you can see overall your algorithm is represented by the G f of X is equal to R now you want R to be as close as possible to your input layer so you want that data to be very very close to each other so and that's exactly why the objective of an auto encoder is to minimize the loss function now what the loss function means is that you want to reduce and minimize the error between your input and your output the way that these neural networks are trained they are trained through back propagation and what that means is that is the recursive process such that it's able to minimize the error between your input and your output and also something just really interesting but I think maybe you might find interesting what I do so autoencoders have been around for decades now people such as the laocoon and Hinton have used it are you all familiar with them I mean ok cool now let's move on to uses of auto-encoders right so the first being dimensionality reduction that means that you take your data you condense it into a lower dimensional space the reason for doing that is so that your data itself can be more easily represented visually and this will really assist before you applied into a neural network the next example would be denoising of data you can see that initially these images are very hazy fuzzy it you know you can't really see what's going on right now but through the power of auto-encoders what happens is that the noise is removed and it's a more crisper image you can see so now a third example is anomaly detection now what anomaly detection is it's basically a technique for identifying patterns within data so patterns that do not follow the norm so for example in autoencoders for example we have this idea of reconstruction errors so if an observation right it's passed in and it doesn't seem very similar to its input like there's a drastic change there the difference then that would be considered as an outlier hence it would be anomalous so you would see these red images these red dots that's an outlier and lastly we get a view of feature extraction so auto-encoders give you a view of which features in your dataset are useful or not so to take you through some of the different architectures that are out there of auto-encoders a very popular one is the restricted Boltzmann machine and this is actually produced this particular papers produced by our beloved Hinton and a restricted Boltzmann machine is basically a two layer or two encoder so how it works is that it has a visible layer and a hidden layer the visible layer is where our input would come in our variable inputs it would use a combination of that to get into the hidden layer then basically what it's learning is the difference between the hidden layer and a visible layer it uses metric called KL divergence to measure that difference between the two this particular paper by Hinton I would encourage that you go forward and read it if you want to get into auto-encoders it's basically how he used auto-encoders restricted Boltzmann machine and auto-encoders for dimensionality reduction and he actually compares this with PCA and the results that he gets is that with autoencoders he's able to reduce dimensions of nonlinear type of data so the results that he gets the patterns that he under lives were much better than he got for PCA within the field of autoencoders there's two different types of popular architectures there's under complete and over complete so what angela has just described to you is an under complete architecture so remember what she said is that we're trying to find the underlying pattern within our input but to do that what we need to do is to ensure that the neurons within our hidden layer are less than the neurons within our input layer to ensure that whatever our reconstruction of our output is it's not a direct copy of the input then it didn't learn the underlying pattern it needs to be a pattern that's how we ensure that so that is under complete so that's usually for most use cases that's how we implement an an autoencoder then we have the over completes architecture and there's three different types of them so this parse this denoising and contractor so how many of you are familiar with regularization in neural networks okay a few of you so within your own networks um what we usually do is that within these different if if we find that our neural network is overfitting what we sometimes can't do one technique to listen that is to put in a regularizer which means it's to penalize the variables within our weights right but with an autoencoder the spa's auto encoder uses a regularizer but it regularizes the activation functions before getting into the hidden layer as well as here that is to say that you could have an architecture of any type however some of the activation functions are not being initialized that means they not all of the inputs would have been necessarily used so if you experiencing if you build your auto encoder and you like oh my gosh still not finding that underlying pattern there's a lot of noise in my data this is a good thing a good technique that you could use another problem that occurs when implementing an auto encoder is that you get the exact copy like it's so annoying but what you can't do if you have such a problem is used the denoising so what do you know is ink does is that you add noise to your input layer and then you use the same under complete architecture of an auto encoder so it is quite a lot if your reconstruction layer is exactly like your input the contractive is similar to denoising the problem with adding noise to an input is that you really don't know how much noise you should admin so what contractive does is that in your activation functions it finds the derivative of each activation function so it reduces what they as the inputs so what it's it but that in tells is that it's more robust to noise so the more noise you have in your input because of those derivatives it's easier to learn that particular inherent pattern so we have many Python libraries available to us if you are interested in building your very own auto-encoders so first being Kerris which is basically just an abstraction level that sits on top of tensorflow then we have pi torch and then we all know if I could learn I'm sure so the very well notes I could learn and then we have h2o but for this purposes for today we'll be showcasing h2o alright so now we've reached the stage of the jupiter notebook but before we begin there I just want to ask you all a question so who have you have experienced any fraudulent acts in your life just array the pans cool so that seems like quite a few of you right now imagine like within industry as well they must probably be experienced vast amounts of fortune and activities that happen to them on a daily basis for instance we could look at within the banking sector were all very familiar with the tappan go system right so now imagine if a card is being tapped 200 times on the same day isn't that a huge red flag like someone's clearly taking your money unless you really like chopping a lot LMO sorry and then telecoms we get for engine cases like SIM swap fraud or delivery jessa4 so for instance right it's your customer information however the product that's being delivered to you is not sent to your address but it's sent to an address that's who knows 200 kilometers away from where you stay once again yet another red flag right and then in the retail space you can get fortunate acts like stocktaking or online purchases yeah so earn an example of an actual fraud case that has happened was what's called the Japan ATM scam this affected the standard bank that we know though it happened within Japan so what these fraudsters did it's like for real Ocean's eleven but these fraudsters that it was around a hundred people according to the article it's a suspected out around a hundred people went to various of ATMs within Japan and started taking out cash one of the banks that were affected within South Africa was Standard Bank and Stennett Bank lost 295 million rent from this particular activity they did this under three hours that's unlike the solutions so we would if I was the CEO of Standard Bank I'll definitely be like okay you fool me once I definitely wouldn't want again criminals are stealing from me the exact same way rights so we find such an emergent type of fraud that occurs and a business get scared so what we do is that we would ensure to reduce that we'd have either supervised learning model or rules so that in let's say the first month we get such a big spike of fraud in the next month that we want to reduce that so we would compact that but definitely those guys who stole that money I'm sure they have a new creative way of of stealing from a different type of company all standard bank and so what she wants to do within an organization is to try combats that emergent type of fraud so you can have usual fraud cases but also new types of fraud and if we have an actual algorithm that does work well let's say it's 70% accurate maybe 50% of that money could have been saved cool so just like my lady was mentioning you have like the whole idea of emerging fraud and then like a rule based fraud so banks if they really know the kind of fraud that they that are happening right now there's just these will based systems that'll combat it so just once you explain the concept behind anomalies first fraud so as you can see in this Venn diagram right something that is anomalous does not necessarily mean that is fraudulent but something that's fortunate may mean that it's anomalous cool so now I just want to ask you guys as well have a look at this table what stands out to you what is the anomaly yay you get a chocolate cool so fantastic but now but now if we think about this right imagine in a real-life situation when we we're not only just looking at six rows now we're looking at ten million rows and we want to cater for real-time situations in real time are we able to identify the anomalies in the data set and not only what we just have password change occurrence as a variable we'll have millions more so cool that's where anomaly detection using water encoders can play a role so now we move into the Cagle data set apologies for spelling we are data scientist not English teachers cool right so the keidel data set is just the data set which I'm sure you're all quite familiar with it's called the credit card data set and it's based on transactions of customers so as we begin you can see that this data set is highly imbalanced as you can see there are very few fraudulent cases which make up 0.17% of the data set now for machine learning algorithm to learn such a thing it makes it really difficult so cool but we'll explain how to combat that later on and then we read in our normal import so because we were speaking to h2o we'll be using the h2o deep learning estimator library we read in our normal packages we then begin by initiating your spa context and your h2o context followed by now this is where the fun begins right we read in our data set using spark we transform that spark data set into h2o because remember we're working with h2o models right now you can't pass in a spark data frame into an h2o library so hence you need to convert it then over here we defined our features list now because this is an online data set it's anonymized but in real case situations these features could represent things like maybe the number of times you've was drawn from an ATM is your card linked to the app how like how often are you in overdraft you know just like those kind of features then you take your data set and you split it into a chain test and then remember before we were showing or showing you how the data itself was highly imbalanced so in order to combat that right you train the model on what looks like normal what is considered normal you chain that so that the model learns so that when it's given unseen data and it picks up patterns that don't follow what it learned then it will flag that as an anomaly cool so now we begin with defining our h2o deep learning estimator we pass it through a variety of parameters I'll just go through a few so the one being the model ID which is purely just the name of your auto encoder so when you do save it for reuse of later on you can reference it an activation function of ten and a few hidden layers then you chain your model over here now you savor your model cool so now that we saved our model we want to reload it now that we've reloaded the model this is where the fun begins this is where we actually identify anomalous behavior so we apply it to a testing set and we produce these reconstruction errors now if you remember these reconstruction errors are how different is the output from the input so as you can see equal this is the overall reconstruction error but now what if you are interested in identifying the reconstruction errors per feature we can view that over here so this over here will show you the reconstruction error per feature so it's just to show you and give you a sense of which feature contributed more to a particular observation for a customer yeah and then if you are interested you know after this presentation you can go home and you can build your own auto-encoders you can visit the Kegel comm website and you can get this data set and so just a recap right of reconstruction errors in terms of like a real-life situation with this image right the input data would be your pixels and then the output would be the reconstruction errors without the noise cool so I hope that clarifies reconstruction errors all right so now we're going to get into showing you the trick sense dashboard that we've built that we as data scientists and business stakeholders may be interested in might be a bit cheeky with tricking and holding the mic so as you can see within this dashboard we chose what you see in over here is what's down then red is the normal pattern that the algorithm caught and what we have above in blue is the anomalies so how we picked the anomalies is was we we picked a threshold of 0.01 and we picked that threshold just based on what we saw from the particular diagram so what the amounts of anomalies we caught is a hundred and five anomalies right so let's say I want to check as a data scientist that okay how much was actual fraud and how much did my prediction was one as you can see or if I close predictions as you can see that I my anomaly detection model picked up 69 fraudulent cases out of 89 if I want to check as the data scientists that how many fraud cases that my predictor not get so out of all the fraudulent cases it didn't pick up 20 fraudulent cases the anomaly detection model so it's not a bad model it's pretty it's pretty neat for a fraud project well as you can see here we have 0.16 percent of fraudulent cases and more there's more anomalies that we found which goes to the results that we have what you see here is actually the reconstruction errors that Angela described for each and every variable so from fraud analysts point of view that first one might be number of possible changes our example our initial example so the fraud analyst would see that okay these are the variables that are most impactful for different types of fraud if the if the fraud analyst wants to see for a particular customer customer one seven zero five they experience fraud we picked up that fraud they would see that these are the variables that actually impacts at that particular customer what we added to the dataset Angela added we added places we just added that randomly but usually with a project you want to know maybe a particular area that is experiencing fraud and by how much and see what variables are impacting there so that you're able to contact the customer and help them accordingly okay so just to share with you some key takeaways that we have from building this models at scale the first thing is the interpretability so we showed you how to interpret for the particular for this particular model and what does happen that if you have quite a lot of features it can be difficult for a fraud analyst or whoever the business stakeholder is to be able to interpret why is this particular case fraud so that's a common problem that we have another problem is that and maybe this is a general machine learning problem is that if there isn't an underlying pattern in your data then the or think ah don't do anything for it so if that's the case you could think about building more features that may assist you to get a particular pattern then when it comes to maintainability when we build it at scale a big reason why we chose h2o because we build models at scale with that so if you want to build it you can use the Cagle data set but we also chose it because of that then just the difference with k-means and an autoencoder using an anomaly detection problem so we have used k-means before and you'd find the distance between the chester centroid and the observation does show anomalies but maintaining that code so when you have to retrain your model with new data your cluster sense has change your clusters change inevitably what you are trying to find in anomalies to change but with an autoencoder it is much more consistent then with a threshold this goes to capacity so you saw we detected 105 anomalies when you are working at scale with much more data it might be 10,000 anomalies now sending 10,000 anomalies to an actual business they call that to work through might be a bit difficult so they might not have the capacity to do that so picking a threshold usually you have to work with the business area to understand what threshold is been suited for them and then lastly a feedback loop we would want overtime to know that what we pick up as anomalies was actual fraudulent behavior and sometimes getting that feedback loop is difficulty so just to really like sum it up right so concerned parent if all your friends jumped off a bridge would you follow them machine learning algorithm yes so basically all in all just want to sum it up to what we spoke about today just because a model may say that something is anomalous at the end of the day you also need to check does it make sense to the use case now that make sense to the stakeholders I don't just listen to the machine learning algorithm there's still more to it you still need to bring in like you have the human side and to understand that it makes sense with business say yeah I mean thank you so much for listening thanks ladies presentation was great with regards to the Reconstruction era have you experienced any cases where you've got like really high variances in the range of your reconstruction errors and then like if you have what approaches have you taken to like scale those or have you worked with them just as you know so let's say you've got a reconstruction error on 0.05 on one observation and then you got on another observation a reconstruction error of I don't know like a hundred right so in that case like have you experienced that and if you have like have you dealt with any sort of like normalization of step or standardization reconstruction era the one that is higher to us we looking for that it's noise it's the anomaly that we're trying to find but so yeah we haven't we haven't dealt with that yeah even with real with these cases that we've worked on our Styles like ok but I think it goes to those different architectures that if your general generic Reconstruction era it's not finding the pets and there's still quite a lot of noise then maybe try use the spa's ok thanks [Applause]

Info

Channel: PyCon South Africa

Views: 4,420

Rating: undefined out of 5

Keywords: python, PyCon, PyConZA

Id: Alkm-PJu9To

Channel Id: undefined

Length: 27min 33sec (1653 seconds)

Published: Thu Feb 13 2020