Cybersecurity and AI with Ashrith Barthur

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone I'm Astrid and I'm a security scientist at h2o well I think the first question might be what's the security scientist doing our h2o and the answer to that is yeah we're actually trying to build things above the basic stack of h2 itself and that's where you know more in the business applications come to the picture and this fits in one of the microwave architecture so what are the few problems that we look at when it comes to network security malicious threats you know from the insiders from the external people we have distributed denial of service attacks we have data loss and we also have user behavior antics and you have to understand that most of the network security field is very paranoid so they run with rule-based algorithms they don't want to miss out any single incidents so which is one of the reasons they try to refrain themselves from using machine learning for their algorithms and that is something I'm actually here to change so what we see here is we use rule-based algorithms we have lot of experts you know who analyze data who tell us what is right and what is wrong and they investigate different kinds of situations and then they come back and say whether something was positive or something was not now how can we change that that's actually a big question one of the some of the reasons that we need to change this is because it actually takes time and it's the process is slow but it is still justified because we don't want to miss any of the incidences that might have happen as I said earlier so let's see what we're trying to do here is we're trying to look at this use case and I'm so I'll be speaking about two use cases out here and that'll actually give you a perspective of how we are trying to fit in machine learning in the process so again one of the things is the consumption of time large amount of manpower is required and the process is very slow and this kind of is a problem when you have lots of data that is coming from different security incidences because you actually have to go through all this in a short period of time and come back and say whether incidence was actually right or wrong or did it actually happen or was it a false positive and there are very limited number of people in this field frankly there are a lot of professionals who operate on data but there are a lot of limited number of people who can actually understand every incident in this field and that's where that's why the whole machine learning process actually comes into the system so a simple difference most of the people assume that security identifying behavior in the field of network security is just identifying outliers it's not exactly the same thing so what most of what we do is identifying anomalous behavior an anomalous behavior does not have a precedence usually does not have a precedence it could be a behavior that exists you know within your normal behavior it could be a behavior that exists you know with the outlier it could be anywhere across the child but it's not exactly identifiable but outlier although low but although it has very low probability it does exist and that's the basic difference between outliers and anomalies you know when we look at network security incidences so identifying anomalous behaviors are actually quite difficult you you have to actually model your data in in a way that you can identify these behaviors so how do we go about doing this what you do is you first create something called as a context and the context is the primary idea under which you can identify different kinds of behaviors so what a context is is an actual scope or a framework under which you start analyzing your data so let's take a simple example of you know people coming through turns those you know in a gate now if you were to see someone who is not expected to come into a building but if if the person happens to be walking in how would you identify that that person actually walked and was not allowed in the building and and that's why you create a context it's like what time did he come in did he come in on a holiday you know it looks very much like a rule based system but that itself is very flexible and I'll tell you why very soon it so as I say it's very rule-based but it looks very a little bit rather but you do use the same system to make this effort so what do you do is let so as I was talking about it let's take a simple example yeah I know there's a lot of text but I will take you through this very quickly so let's assume that you have the use case that you're working on is a Windows log in use case where you have users logging in all you have is their login a successful login times now could we identify which user is actually malicious or which user is not and so to do that we actually have to create a context and the context goes such so between those there are different kinds of users you have system users you have administrators and you have you know actual systems system accounts which which perform different kinds of things now that itself starts it gives you the idea of how you separate your data the moment you start separating your data based on the kind of users your the data your analyzing for these different groups is actually going to be different the next thing that you do is windows logon actually happens in different kinds so you have physical logins you have you know network based logins you also have remote logins and you have terminal based logins so dot two adds through the context that you create now interactions between these two interactions between users and interactions between these different kinds of logins actually creates the context that you wanna identify and recognize what is what which part of the behavior is anomalous and these interactions are what we study so here if you see this is this this is one of the use cases that we were working on very early on is the windows use case I'm trying to identify animals behavior and here if you see you just see very simply you see three clusters this is very this was very early on in our work that we did so here when we break the data down we see this is actually an administrative user that's a system user and the final one is you know sorry system accounts so each of this each of these data blobs or you want to say clusters actually give us the concept of the context that I'm speaking about you divide the data in a certain way and then you analyze the data in that family or in that context and if you were to add the type of logins as well on top of this you would see that the data would further divide itself by forming different clusters within this small within this three different user groups now so what is the problem with this is that with all the algorithms that we develop with everything that we do we can only predict to a certain extent that a certain login was actually militia what could be malicious we can't say with 100% accuracy that it is actually malicious and that is a problem because what we are telling what we're telling the business users is that we do not exactly know that this is bad or not we just can predict with a certain amount of probability that this is bad and that's not a good thing specifically when you're dealing with security people because they want to know if an incident actually happened or not and for this we actually use so if I go back to the slides sorry little so as I spoke here we speak about experts and professionals so for the price for we use these experts and professionals to help us out here in terms of understanding the data what skip forward again yeah so here what we do is we use them we actually shot lists the data that they need to analyze we use their their help to understand different processes that are actually valid to be identified as malicious and the ones that are not and with their investigation we actually get to know whether something that we identified was malicious or was it just a false think of false positive and so we have so most of the work that these people do if I just go back to the words that I said there is a case that your context can actually your context are actually very flexible in the sense that you couldn't you can change your context you could modify it and this primarily happens when different families that you have identified in which you're building context can actually be can actually be homogeneous so some of them for example like a user of any law when when he logs on you know remotely or through a network system could behave in a similar way so you could merge those two contexts and make them look one you can have different thresholds for different contexts so that tells us that a certain behavior is necessarily normal in one context but it is not normal in another one so that helps us identify so using this process of you know supervision from the experts it actually helps us understand how do we vary these parameters and how do we come to a conclusion so this is one of the consoles that we are actually using as an information system for the experts and professionals so what we provide here is we provide them information about different things that we feel needs to be alerted for them we identify different situations and we say hey here's an alert here's a situation that's actually happening can you you know investigate and find us and help us more about this so this is is actually a system that Tony steamed I think Tony's right there yeah Tony's team is designing for us so that's something that might be interesting to look at so where does this lead us to so the so after we've done the whole supervision process one of the ways that one of the spots that we actually end up with is as I said earlier lots of data we have multiple logs and we have lots of data that we need to analyze and build in some kind of a correlation and that correlation is actually very important for us primarily because most of this data that exists most of the log data that actually comes out don't usually are not usually strong enough for your identified behaviors independently so when you collate them usually across time you can figure out you know you can figure out what kind of an event is happening and that actually helps you identify incidents is even better so let I'll give an example in this case so let's say you had a user login okay and I'm going again with a login but this is a slightly different use case let's say you had a user login which started off with multiple fails and then you had one successful login then the machine that this user connects to that attempts to connect to a database server okay and then you see that there's a request made for the data to be dumped out of the machine and then the data gets moved back into the machine now if you were to look at these events in in if you look at these events individually then you would see multiple login attempts and one final successful login we all do that everyone forgets the password you know every 180 days we forget a password we get an we try our old password never works but we get in finally that's the usual incident the next thing that you see is the connection to a database I mean we all work with data so connection to databases is normal it's not a big deal and then the data dump let's say you're creating a new table you're just moving the data to a new table perfectly fine but and you're also drawing the data down from you know from your database to your local machine I'm sure quite a few of us do that provided it's not sensitive data while we are trying to analyze data now if you were to look at these incidences separately you would see that there is nothing wrong in it everything is fine but when you put all of them together that is when it actually starts to make sense that is when you actually realize that oh this is an attack this is probably an attack and that is what we are actually trying to get to is that we're trying to create this kind of an intelligence which is which we are able to capture and say if you were to look at all these logs if you look at all this information that comes by in in such a way that you know you can figure out what is happening on your network then you are you're pretty sure of identifying a malicious incident that might actually happen on your network and then that's that that's exactly we are actually going at and trying to design with it so what do we learn here so as I said earlier you know anomaly is not your anomalous behavior can be very well embedded in your natural behavior now one of the things this correlation helps us identify these anomalous behaviors we using using these kind of log event correlation we figure out that we can identify these kind of behaviors by observing a certain behavior that the combination of these events in this larger context and that's when we figure out that oh this is actually anomalous and that is the primary need and that's the primary way an anomalous behavior is actually identified it's not just you know outlier detection is what I say again so in my you know just just trying to summarize the whole thing as to what I've spoken right now is what are you trying to find here I mean you're trying to identify the right context to identify an ominous behavior and one of the reasons anomalous behavior is interesting because most of the hacks that we see these days are not necessarily you know the ones that people have tried it's not those kind of people you know who take a script run it on the computer and see if they connect to your machine and download whatever they want no it's it's organized people you know who are well funded who know how to break into your machines and who do it very quietly and really well I must say so and you know identifying how we can correlate logs that's another important thing that we've learned and if you can transform analyst behavior into some kind of statistical model you know you can identify it that's a good thing and you also have the blessings from the experts so that adds value to it as well so yeah this is this is probably quite descriptive so you know what I wanted to say I do want to thank everyone who's come by you know all the support people and the open source members of HCl and now raise your people itself and finally our clients really appreciate everyone thanks for making this happen and before I close these are three people I work with this is Mike Chang he's literally a ninja this is ivy she's the one who designs you know interface and that's Fonda she's the one who helps us understand you know our entire requirements and stuff like that so one thing the entire team out here and any questions anything about a big wave Thanks
Info
Channel: H2O.ai
Views: 8,001
Rating: 4.4936709 out of 5
Keywords: H2O Open Tour NYC
Id: nUNmcfD4TzQ
Channel Id: undefined
Length: 16min 58sec (1018 seconds)
Published: Fri Aug 12 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.