Introduction to Neo4j and Graph Databases

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you [Music] okay well let's go ahead and get started I want to welcome everybody to our next edition of data club this our first one in the new year my name is Eric Olsen I'm on the core OS are basically the DEP a developer platform team within cosign which is part of Azure and if you haven't been data cut before just reminder what we're here for basically these are training and sharing sessions for data I'll say data science hub boot also includes data engineering related work and there's a number of different formats we use which you probably seen before if you've been here before but this is our place for collaborative learning our chance to share with each other and actually learn from it what other people are doing and the types of problems are working on these are basically training sessions or information sharing sessions to help grow the capability the organization or a lot of times just make people aware of what's out there if you're if you're not aware and it also builds you opportunity to build connections with others in the data community who might be working on similar problems who find help essentially so as a networking opportunity and so our first real light data Club is to talk about data club as to participate and learn together that's that's how we leverage our learning you get better and then if you see something you like here be sure to share it with those that couldn't attend or reminded there's a recording that can get it afterwards and they can check it out and then finally looking ahead we have these these sessions about every two weeks so our next session is on February 7th our tentatively right now we're going to be talking about a jury ignite impact so some of the data analysis around in ignite and then on the 21st we'll be revisiting the business intelligence craft they're enormous big to get some updates of what's going on there so we're about commercial space if you want to keep up with information about these sessions you can join the data club alias on i.t web you'll get all the announcements and all the meeting requests and then if you miss anything here today what I'm talking about then don't worry you can always go to aka.ms/offweb so you can give us feedback on what you liked what you didn't like what you'd like to see are things like that and with that I will now turn it on to David to talk about neo 14 hi everybody thanks for taking the time today I love talking about graphs let's do some of that today we're going to talk about introduction to neo4j and graph databases since this is a data science group I do want to towards the end get to talk about graph algorithms and some of the data science applications but because graphs represent such a different way of thinking about data we need to go through some basics first um this is who I am I'm a partner solution architect at neo4j I have a pretty fun job there because about 50% of it is business and about 50% of is technical I do a lot of integrations between neo4j and some of our strategic partners Microsoft and Azure being one of those but I get to play a lot and I get to play with a lot of new fun stuff from all over the industry and that that makes me happy you can get me on Twitter send me an email if you have any question and after this session a couple things that I want to cover in this session go over an introduction into graphs and neo4j and what the relationship between is talk a little bit about why people are using graphs and when you would use graphs versus some of the alternatives that you have we're going to talk about the underlying property graph data model because you can't really do rigorous analysis or data science unless you understand what's underneath we'll talk about this cipher query language and about how we query and manipulate graphs talk about some data import and then how we develop applications we're going to have some demos in here too so as we go we will have one demo from Ambrose that's about Microsoft specific stuff and then I'll show you some other demos to give you a kind of a flavor of the toolset around this will also talk about graph algorithms and data science applications so if you are one of those technologists who doesn't want to just listen to slides and high-level theoretical stuff take note of this if you want to follow along and actually use the software while we're talking try out things that I'm talking about on the slides this is how you do it we are in the azure marketplace so you can search for neo4j Under and launched a single instance universe 3.5 dot one is the latest that's on the Azure catalog or if you know absolutely nothing about neo4j and this is your very first experience go to neo4j sandbox comm it will allow you to launch any of the sand boxes that you see on the right hand side and get you to a GUI really quickly so you don't have to think about you know what port do I go to any of that stuff sandbox is the fastest way but there is the azure way to on sandbox you're going to get a free temporary instance that only lasts for about two days I think and it starts preloaded with data so you don't need to know how to load or query or do really anything with an empty database so with that out of the way let's start talking a little bit about graphs and the basics so why graphs um you know I'm kit was saying at the beginning of this session he got this sticker at our conference graph Connect graphs are everywhere we truly do see that basically the entire world is a graph and everything is connected you have people places and events that are thoroughly connected with one another companies and markets countries history of politics there are so many endless examples of these things it's it's very natural and visual for human beings to connect things in their mind as graphs rather than to use some of the other formalisms that we've become accustomed to as computer scientists and as technologists we learn how to think about things in terms of relations and sets and json documents and things like that but that's an abstraction that we've mapped a more reasonable understanding on to not how our brains work our brains tend to think of these things as connected nodes and edges whether it is flights going all over the globe or any other use cases so before we get into the the bits about neo4j us talk a little bit up on the business side of what people are using neo4j and graphs for our business is primarily focused on the global 1000 or we have a heavy presence in retail finance and a number of other sectors to give you three concrete examples of how we're transforming large enterprises we do real-time promotion recommendations for a lot of these big retailers and so you know whenever you see that somebody has had these records cyber monday sales part of that is you know consumers buying more online and part of it is that the retailers are consistently getting smarter about how to do product recommendations a lot of that product recommendation stuff is driven by near forge a behind the scenes in that we can model the products and who is buying what as a graph and then we can create social recommendations for users you know you might like what your friends have purchased and any number of other recommendation approaches using near forge a Marriott uses us for real-time pricing with 300 million pricing operations per day um one of the other things that Marriott has found and you know we can talk about this a little bit more and in architecture parts but we tend to require a whole lot less hardware and we get to index free adjacency and talk about how the database works a little bit you'll understand why that is folks are frequently finding that they can replace large fleets of relational database clusters with fewer instances of a neo4j graph database when they're really focused on real time relationships and we also work with large postal services for handling package routing in real time and so if you think of the Traveling Salesman problem or a network of roads as being a graph of sorts package routing is full of shortest path type queries I need to get a package from point A to point B going through the fewest number of logistics hops in the middle so that's you know what you would think of as a fundamentally graphi problem just to kind of go through a number of other use cases we cover in two categories internal applications we have a lot of folks using us for master data management and so they would take the metadata from all of their systems and then put it into neo4j and draw a lot of correspondences back and forth to say this field in this system is equal to this field in this other system a metadata catalog if you will we get used for network and IT operations where you have to understand the topology of a network and how of your infrastructure relates and this is used for things like critical path analysis so which router if I knocked it down would take would knock the whole data center offline that's an example of another kind of a graph a use case and fraud detection so a lot of our financial clients will put transaction data into the graph and then we'll ask you know maybe I sent five thousand here five thousand there five thousand to a third location but if all three of those recipients of the funds are controlled by the same party in aggregate have transferred fifteen thousand to somebody and so it gets used to sniff out financial fraud in customer-facing applications real time recommendations particularly products for retailers graph based search and Identity and Access Management so that's what graphs get used for what are they in much the same way as a relational database has a set of tables rows columns and schema let's talk a little bit about how graphs are structured fundamentally it boils down to nodes relationships properties and labels which will go through a node is simply an object in the graph and it can be labeled a label is like a semantic category much as you might have a an entity in an entity relationship diagram so in this particular graph we've got persons and we've got a car relationships relate nodes by type and Direction relationships in the air for J are always directed you may traverse them undirected if you want but they always fundamentally have a direction relationship so a relationship always has a type so if you are traversing a relationship you can segment that out and say that you only want to traverse certain kinds of relationships not just any way that these two nodes are connected so in this particular case we can tell from this really simple graph that these two people love each other they live with one another one owns the car but the other drives it regularly so properties are basically key value maps that get associated with nodes and relationships they can go on either and so by adding properties on top of these nodes we realize that it's Dan and Ann who love each other and that live with one another dan has a twitter handle and when we look at this relationship drives Dan drives the car we can put metadata on the relationship and assert that that's only been since 2011 so basically you can think of a node and a relationship as a property container where properties are simple key value Maps yes sir my relationship is the same as pages yes so in the graph world sometimes people refer to vertexes and edges we tend to talk about nodes and relationships because we find that language is more accessible but in the math literature you'll see vertexes and edges and it's we're talking about the same thing yes relationships so there are two person right or you have some sort of like ontology on top like between person s entity and okay so there's a couple of different ways of going with that question there is not an on the the the question for those who didn't hear it is is there some sort of an ontology on the top that specifies what kind of relationships you can have we're going to talk about constraints you can assert constraints that certain kinds of nodes must have certain kinds of relationships however those constraints are optional neo4j does not have an ontology layered on top of it and those sorts of schema constraints are optional right so and furthermore the constraint may assert that that you have to have a relationship but you cannot for example assert that there could never be a drives relationship between two persons okay that would be a different sort of constraint does that answer your question for now we're going to be talking more about constraints any other questions before we go on okay so just quickly summary here nodes or entities with complex Val types relationships connect them and structure the domain properties are basically these key value pairs they tend to Express metadata about your nodes and your logical entities in your domain and labels group nodes by role and so usually you think of labels as the the entities in an erd nodes you would think of as the instances the rows in a table and relationships can be thought of as joins which will go into much more depth on that yes that that is true so the question for those who if you can't hear it online is how do labels differ from properties so labels are an optimized indexed way of scanning to a particular subset so you actually could do it either way you could have nodes with no label at all and then you could have a property that says let's call it type right and then say a node with type equals person all right when we get to how the cipher query language works labels are a lot more intuitive to use in terms of structuring your domain and they're also more performant underlying in terms of how the database is implemented but you could do it either way like if you came onto our community forums and asked that question we would probably say you can do it both ways but please use labels for lots of reasons okay one of the really cool things about this graph model that we're talking about is a property that we would call whiteboard friendliness when our field engineers go out and work with customers frequently the customers have not been exposed to graphs before and they don't really know how to approach working with graphs and modeling their data so just in a very human open way we get a whiteboard out we get some markers and we say so tell me a little bit about your domain so you've got customers and so you draw a little something on the board and you say off they buy products so let's draw another little circle called product and then create a link between them and so you you have this elicitation session if you will where you're trying to get them to talk about their domain what some to them what the data means and you draw that out and so to give you a simple example if we were talking about movies and actors you might end up with a whiteboarding session like this so you got Tom Hanks who acted in Cloud Atlas Hugo Weaving was also in Cloud Atlas but you know I don't know if you guys like the matrix at neo4j we love the matrix okay it's in the name people all right we love the matrix anyways so Hugo Weaving was Agent Smith he was in the matrix and the lana wachowski happens to direct both of those movies so you can elicit this information from use or get this really rough whiteboard sketch going right I don't know like wow look at that that's a graph we got a node called Tom Hanks who acted in Cloud Atlas and so on and so forth so that's how simple the translation was we literally just applied this on top of it then we're gonna do is we're gonna slap some labels and we're going to property if I if you will what do we care about people what do we care about these movies right so a person who is an actor has a name and maybe a birthdate when we say that they acted in the movie it's probably important to know what role they played so we'll give that a role property Cloud Atlas was definitely a movie and we're gonna want to track what year it was released in and so on and so forth so when we say white board friendliness this is what we're talking about go from I understand my domain inside of my own head elicitation session to a rough model that we can query really quickly all right now I don't know about you I work with relational databases for years and years and years and it's very easy to get bogged down in these conversations if should we be third normal form or not and and what the data is and how we think of it often get very radically separated from one another okay so this is a this is a screenshot of what the result of this is going to look like as concrete data inside of neo4j when we get to the demo you'll get to see these springy cool graphs moving around so we've talked a lot about graphs the property graph data model so that brings us to neo4j itself hopefully at this point in the presentation it's not gonna come as a big shock or surprise but new 4j is a graph database a couple of properties about neo4j we support strong acid transactions so we are not an eventually consistent database you get strong acid guarantees it's very very fast and I'm not going to ask you to accept that as like a marketing claim we're going to talk about index free adjacency and I'll be able to tell you in terms of data structure why that is so we can get two to four million operations per second per core it comes with both binary and HTTP protocols that are have a lot of different language supported drivers we'll cover that we have a clustering approach that provides for high availability so you can have multi node clusters and you can survive the failure of multiple nodes in your cluster and still retain those strong acid guarantees and stay in operation and no size limit yes so it's VM that you can or you take this and put it into your how do you actually is it like service being able friend is it a service we will form it like Siri set yeah essentially you need to install to me take care of it yourself it is not available as a managed service at this time that's something that we're actively working on so it's provided as a VM based deploy so yes you you take care of the VMs once it is launched okay but you can of course create your own VM and then install it much as you would any other software package but I wouldn't recommend doing that we provide all in the azure marketplace it's way faster to just launch the version that we offer that's already configured nicely and so on and so forth ok so it's a native graph database its schema free schema free is a little bit misleading it's kind of schema optional we have many schema constructs but they're not required that you use them let's see it gives you a really nice developer workbench that you'll see and one of our superpowers relative to other graph graph databases is the cipher query language we're going to talk a lot about that and why that is so important let's talk about it right now graph query yes sure posed opponent now for G how does it compare it was the graphical uiju sequel server that was recently introduced in Miss Kosmos DB which also supports graph using different query language maybe you know I have a very specific answer for that but it's coming a little bit later can I can I park that question and return to it okay because that kind of gets into I mean the the really short answer is if you have a graph abstraction you can kind of sorta do graphs either way but the underlying implementation matters a lot in terms of your performance and scalability expectations and I hope to talk a little bit about how that's implemented under the covers and then when we talk about how graphs work on top of sequel server you'd see some clear differences okay I'm sorry I don't let me dodge the question okay it's just that like if we don't yet haven't talked about cypher I'm going to give too much information too soon all right so cypher is a query is a pattern matching query language made for graphs now I'm a big database geek neo4j is not the first database that I've ever worked with I kind of love them all for different reasons okay and one of the things that I'm completely unreasonable about at this point in my career is I have to have a declarative query language I do not want to write code that goes and tells the database how to fetch data I want a declarative language where I express what I want and then it's the databases job to go figure out what's the best query execution plan to go do that now if you guys have been using sequel forever and most people have you're used to this all right you just Express what you want and you don't think about which index gets used first or anything like that like this is an extremely powerful thing and yet some of the newer no SQL databases have trained us to go do with less than that right and so this is the point where I'm going to be unreasonable you need a declarative query language if you're going to work on serious database and cipher is is that for grasp did you have question online just let you know there's a little delay so there's no special maybe a little bit delayed but smooth is asking does New York or day support grab strings I would ask the questioner to clarify that so yes in the sense that you can ask a query and the result can be a stream of things that you process as it comes back but I'm not sure if I'm getting to what the questioner is asking so cipher is a pattern matching query language made for graphs it's declarative hopefully I've already convinced you that that's a really good thing it's expressive and it's focused on pattern matching now if you remember the whiteboard friendliness point you can probably follow why pattern matching is important we want to be able to write a query fluently as we think about how the data is structured so here's a pattern in our graph model we've got dan loves and two nodes in a relationship what does that look like in cypher it looks like this a person named dan loves a person named and I mean you can read it from left to right and it almost looks like the actual pattern in the graph so this colon person is how we tell cypher we're talking about a label the brackets is how we talk about that property map that we wanted right so name equals Dan loves person and labels and properties you'll notice that in the round brackets parenthesis if you're American are nodes so when we when we ask to create a pattern we can do the same thing with labels properties and relationships so we can create an entire pattern in the database just by that by like visually describing it and just saying hey go go make me one of those we can also match and we can create these variables okay so a person named dan loves whom returned whom that's you guys probably don't have a whole lot of site for experience but everybody ought to be able to tell me what that query does all right all right great so we've got two nodes and in the second case we're creating a match to a variable on the second node and then we're just returning what that variable should be bound to as a result of what's in the database so let's let's look at a social recommendation query example this is our vp of product his name is philip and here's one of our product managers named andreas and this guy he's amazing I hope you run into him someday his name is Michael hunger so these guys are friends and they like certain sushi restaurants so I sushi serve sushi sushi Sam serve sushi they're both located in New York Philip Rath Lee here finds himself searching for sushi restaurants in New York that my friends like frequently a lot of these social recommendation type questions can be phrased as a graph path and so how you would answer that in cypher would look like this so I'm looking for a person who is a friend of somebody else that friend likes a restaurant that restaurant is located in a certain location and it serves a certain type of cuisine and the the variable bindings that our user has given us is that the person's name is Philip we're talking about New York and he's interested in sushi and so these graph patterns with a couple of variables thrown in and get get used to drive the social recommendation okay so before the colon is the name of a variable that's being bound after the colon is the label okay yes I mean there's the label on the node right so schema right so we say so we talked a little bit earlier about how nodes can have a label and that's sort of like the semantic category of information it is so this node represents a person and this person whoever wrote the query chose the variable name person so it looks a little bit misleading right but the lowercase person is the name of a variable and then the uppercase P is the the label it basically what we're saying is whatever gets matched to this variable must be labeled person does that make sense the no it has multiple des Vosges then it will match it has two it will match to if it's labeled both person and enemy it's still a person those properties what was that yes they're their properties um so when the data was created this node got created with that property and so basically this is placing a constraint that the only persons who can match are those having a name property whose value is filip okay clarification I'm sure if you want to jump back so you're going back to the business if does it support streams uh-huh you're asking basically streams to small perhaps like syntactic structure of sentences or the scrap you come from Twitter for instance and a use case would be extracting significant patters in real-time ooh okay um yeah there's so many ways I could go with that question google neo4j and NLP I wrote this long medium article about natural language processing with neo4j and that that one link which I can't go into that for time reasons right now has a lot of information on this topic on the streaming thing the person can also google neo4j streams and there's a Kafka integration that talks about producing transactions as a stream or consuming streams from Kafka and putting them into a graph and and so hopefully that's going to help without going too deep okay so in our earlier example we had a really tiny graph now imagine this happens in a super massive graph and you have hundreds and thousands of friends basically what these queries are doing is they're finding the best starting points and then they're traversing through the graph from those starting points declarative query for graphs sometimes our developer relations people find these things that people said on Twitter and that's like particularly emblematic like we couldn't have said it better if we had our own marketing people do it so they cap these things and then keep them what I learned in neo4j training today is that you draw ASCII art to code so how true is that nodes are drawn with parenthesis relationships are drawn with arrows with additional details and brackets patterns you connect nodes and relationships with hyphens and optionally specifying Direction now you'll notice this is a this is a relationship going one way this is the same going the opposite way you can traverse it undirected this is saying you know either way right it'll match either way but the components of a cipher query basically look a lot like sequel just with adaptations so match and return our keywords mmm as a variable movie is a node label we actually covered that just a moment ago about how to tell what's the variable and and and what is the type of information you're trying to match in this particular query PR and M are variables notice that we can bind relationships too and we can return them as first class types and we can specify that a relationship we want to traverse must be the acted in relationship so yeah this is pretty straightforward the only addition here is that sometimes what we want to match is not the node and not the relationship but the path itself so in this case what we're doing is we're drawing a pattern we're assigning that to a path and then we're returning the path we have a host of built-in functions that allow you to manipulate paths so for example you can most how long it is you can ask which node is the third position so on and so forth so graph versus tabular results if you do match M movie return M okay basically what you're going to get back is a node if you do return M title m dot release you're going to get a data square and it's going to be a table just like any other write properties get accessed by saying variable dot property name so in this way you can return graph components paths relationships nodes or you can just return tables of information much as you would with sequel not terribly interesting is moving quickly through this cipher key words are always case insensitive and node labels relationship types and property keys are always case sensitive so you know match on the right with funky capitalization is fine and acted in is always strictly all uppercase with an underscore in between no exceptions so aggregates in cipher they're a little bit different we never need to specify a grouping key and so in sequel you have this group by a concept that does not exist in in cipher we always group by any non aggregate keys in the return statement so if for example you did this give me all the movies that this person acted in you're gonna aggravate that by the individual actors name notice there's no group by statement here that's a thing we want to pull out this this is something that often trips people coming from sequel going to cipher is there like how do I do group by and the answer is you don't there's a bunch of different aggregate functions at the at the very back of the presentation at the end I'm gonna give a lot of different links and resources there's a thing out there called the cipher ref card if you google cipher ref card it is kind of like the one page cheat sheet of everything you could possibly want to know about cipher it's the 90% solution to most of my problems when I'm working with cypher talk a little bit about constraints and indexes now and neo4j doesn't have formal schemas as such but we do support a lot of different kinds of constraints and indexes we can create unique constraints basically these allow really fast lookup of nodes that match by properties and this is how you would do that in fairly straightforward English create constraint on label assert that a certain property is unique and so in much the same way as you create a primary key this is how you would do roughly the equivalent in cipher oh by the way that's unique with respect to this label it's not globally unique in the database it's so constraints are always bound to a certain label so there are three kinds of unique constraints you have the unique node property constraint you have the node property existence constraint so for example if we want to create a person we want to always ensure for data quality that they always have a name can't ever have a person without a name and we can create relationship property existence constraints saying you know like for example don't create a company record in our database unless you know who the CEO is so company is controlled by CEO okay a company can't exist without a CEO so in general indexes allow fast lookup of nodes just as they do in other databases you can create an index like this this place is no particular constraint on the values but it drastically increased increases the selectivity of queries when they execute and this is how in declarative languages you hint to the database how it's going to build a plan and how it's going to execute a query efficiently these predicates all use indexes so when you create those inside of neo4j two we have a way of backing indexes differently so you can create indexes backed by leucine or backed by our native implementation and there are some other options as well so you have some flexibility with your data types if you know more about your data type you can choose a non-standard index type and improve performance so indexes are only used for finding the starting points for queries and you'll find this is really a pattern with graph query overall is fundamentally we're not scanning through millions and billions of records and trying to filter that but rather what we're doing is we're trying to identify starting points and then traverse out from that and we talked about index free adjacency the the operation of traversing a relationship is fundamentally very cheap in neo4j very fast and so that's that's why you're going to do it this way we use index scans to look up rows in relational you use those index scans to look up the rows and join and in graph you use them just to find the starting points and then you traverse okay so one last tricky thing about cipher I want to talk a little bit about before we move on is merge merge is how many folks are familiar with up search option you know it up starts about this is kind of like the equivalent of up cert okay so when you merge it is create if it does not exist so when we say merge P person named Tom Hanks Oscar true so if there is there is not a person node with named Tom Hanks and Oscar true in the graph but there is somebody whose name is Tom Hanks what do you think's going to happen here it's going to create the node if you took off the Oscar is true it's going to match entirely on what is in the merge statement if that exists it does nothing if it does not exist in its entirety then it gets created as such so one of the biggest stumbling points with cipher is somebody runs a query like this and they already had a Tom Hanks and now they have two Tom Hanks's alright so quickly some write queries you know create that's pretty much as straightforward as it gets right okay we're gonna create Mystic River 2003 alright what we can also do is if we wanted to modify that but we didn't actually want to create the record we can match it and then set its tagline to be this famous quote from the movie I think I've got some co-workers who were real nuts for Mystic River I was lobbying for a matrix example but they went with matrix who with with Mystic River okay absolutely okay so what if we wanted to create a relationship between two existing nodes well we would match them both we've got Kevin Bacon we've got Mystic River and then we would simply create a relationship between them now you'll notice on either end of that relationship we're using a variable which is already bound to something so we're not saying create a node okay that results in only the creation of the relationship and what we return from that is PRM the the whole all three components so we've got the merged person Tom Hanks example with just the name versus with Oscar equals true suppose you wanted to make sure Tom Hanks got an Oscar but you didn't know whether he already existed or not then what you would do is you would merge just Tom Hanks guaranteeing that it would not create one if he already exists and then you would set P Oscar equals true this would be the way that you can get only the Tom Hanks and also modify him at the same time yes in the scenario if the loss of a car with both parts and also a crystal already in the database will display the second pompons node no it would not this would in the met in the merge it would not create it because there's one already existing that one existing would be bound to the variable P and then in the next Clause the the peas Oscar property would be set to the value true so this is this is shown specifically to illustrate the difference in merge semantics between create the bottom line is that merge merge checks everything that you give it and so you want to merge only on your key values if that makes sense and then set anything else and in this way you can do what you want to do between this much video so okay that's kind of that's though what I'm trying to illustrate so in this is the first merge and this is the second merge okay in this merge what we're telling cipher is go find me a person who has named Tom Hanks and who has Oscar true if such a thing does not exist create it okay that'll work that'll always work but if there is just a Tom Hanks who does not have an Oscar property okay then you'll end up with two Tom Hanks in your database make sense even with merge because you specified that you only were looking for a Tom Hanks where Oscar is true if you do it this way it will look to see if there's any Tom Hanks irrespective of whether or not he has an Oscar property okay it'll find that Tom Hanks and then it only ensure that whoever that guy is he's got an Oscar makes sense in this scenario you will end up with one Tom Hanks okay III point this out because this is a common stumbling block about merge semantics that is usually pretty easy to explain but folks need to know how that works um so merge also has these other two options you can do on create and on match so for example if it mattered to you whether he was new or not um then you could say on create give him a timestamp of when he was created and specify that you know as of his time of creation he's never been updated but if you actually matched him okay then you don't want to update the created timestamp because you didn't just create him just now but instead you want to increment his updated counter okay and in this way you can the cypher planner will tell you whether or not it actually gave you something that already existed or whether it created something as a result of the merge okay before we go on the data import any other like broad questions about cypher yes if you have relationships and notes of the same thing absolutely if I'm saying if you say create and then you give it a pattern that has any number of nodes and relationships it's going to create all of them at the same time in the same transactions yeah yes sir so a mini example of saying that person love movie right return so what do you prefer a label for a person they're multiple like topics and for maybe the same name of the movie you know occur several times in this case we'll do return um suppose there were say three Tom Hanks's or three mystic rivers they all of three will be with the same name yeah in this case your variable would be bound to three different instances and then if for example you said set the property on this variable you'd be setting it across all three words when we can get on query just return it without the same thing if you just did match person named Tom Hanks and there were actually three then that variable would be bound to three different notes and so if if in the next clause you then set the property Oscar equals true it's going to get set on all three so okay so I see so so it was a for example person love movie movie names show up seven times and the person name like there are say that notes and then there will be a five times seven redoubts return because in each of them each each pair of them will form this love relationship mm-hmm yes that that's possible depending on what the what the data in your graph is but we need to get to a more particular example there so it it's like what you're asking about is a Cartesian product I mean there are many ways because you're talking about Rao talk about route there are many ways to organize readouts yeah so I'm asking which way do you choose to organize the results you know I I'm not trying to dodge the question but it just depends on what what you want out of the database right so you can problem the problem I can you need to be infer what I want right I'm gonna say person love something what oh there's there's no there's no inference that's happening right you you the database is going to give you exactly what you asked for but you you know you need some practice with cipher to specify precisely what you're looking for so usually that's where you kind of go back to what I said earlier which is you identify the starting points in your graph like person named Tom Hanks and then you traverse out from that it's possible to create Cartesian products using cipher but you don't usually want to do that it's like you can't really avoid ambiguity in the data for example the same table will always show up a lot of the times you know no matter it's personal in movies right so in this case whether it's a precise language you know well if you find language about you deal with this situation what I can say is I've worked with cypher for a couple of years and I have not run into a situation where I could not avoid ambiguity I would welcome a concrete example of that and then maybe we could work through how to reformulate the query now it is possible to express ambiguous queries but you know I believe that that's true of most query languages and I think we would really focus on tuning the data model and tuning the query to get to specifics I'm not saying it's not possible I'm just saying if that's difficult don't do it that way you know um yes he's going back it's comment your previous names Tom hanger again if they're late from the other side I think you may as I said where if you get more than one match thanks not that it didn't exist that you got more than one so in this particular batch or merge or did they mess right okay so if if you use the match keyword then the variable gets bound to as many of them as there are if you use the merge keyword you're basically saying get or create okay and so if you say get or create then it's going to end up being one so if it actually found mostly us we shouldn't find awful I was hanging right okay that's it yeah yes sir and it's all like here the names that I could be used us that I did to keep us reviews of the yes of the yes is there the three ways to enforce like uniqueness of identity yes absolutely we covered that a bit earlier that's in the unique if I can back up to uniqueness property constraints so there you would say yeah you want to create a where is it create constraint on person assert label dot name is unique and then in this way if you attempted to insert a second Tom Hanks it would fail okay so I mean in much it's it's really the same as you do it in many other databases there all right let's see moving forward okay so data import there are there are so many different options of loading data into neo4j I can't really cover them all so we chose to focus on one one of the most common one of the simplest that people use the most frequently called load CSV it lets you take a CSV file from HTTP or file URL gives you a stream of Records and then basically pipes those stream of records into a subsequent cipher query which you can use to create and update graph structures it gives you transactional operations so whenever you do load CSV it's happening in the context of a big transaction you can transform and convert stuff as you go and it's it's a primary way that when people are getting started they they insert data into neo4j it works up to say ten million or so nodes and relationships um because it's transactionally bound if you wanted to put let's say oh i don't know 20 gigabytes of data into a graph you wouldn't want the overhead of transactions and you would use a separate tool called neo4j import but for simplicity we want to talk about load CSV as a simple way to get started let's take an example sticking with the movies theme we have a simple movie CSV file hopefully this is pretty self-explanatory titles release and taglines we have a people CSV file that just gives us a name and when they were born we have some actors you'll notice movies roles in person so this is like implicitly telling us about an edge in our graph alright or a relationship that this person was in this movie and then you have some directors so recalling our data model a person acts in a movie a person can also direct a movie because remember they can be an actor or director so for every record in the file we want to either create a node for a person or a movie we want to find a start and end node and then create a relationship between them so we covered those in the basic sections earlier about how that works and that's the creating of the relationship at the bottom so how do we actually read the CSV with cipher using periodic commit basically says batch this up into smaller transactions so we don't like pull in a hundred megabytes and then do all that all at once load CSV is pretty straightforward with headers tells us that the first line of that CSV will give us metadata from URL pretty straightforward we're gonna call whatever that stream of Records coming back is we're going to call those rows and we're going to specify that it's semicolon delimited it's not comma delimited data and then that big sigh first thing that I showed you on the previous slide where we're doing all that match and create comes later and so basically now this variable row is bound and we can create graph patterns with the content of row row will have you know if it's the movie CSV file row will have a title attribute that we can use yes sorry about some packaging so somebody's asking a certain about wonky no there is not because okay so that's a that's a fantastic question a foreign key in a relational database is a key to somebody else's table and we put that so that we can do a joint okay there are no joins in u4j there's only a relationship traversal and so foreign keys have no purpose okay it if I saw a person writing a cipher query where they were putting a foreign key on a node and then they were trying to match a bunch of other nodes where its ID was equal to this foreign key that they saw that they had on this other node oh man that would that would be a person who's working really hard to you know that's a person who's really trying to defeat the model too so that they can work too hard so but the short answer is no there's really no no foreign key in their projection um related to this load CSV stuff so in our browser we have this cool function where you can type in coal and play and then give it a URL and it'll step you through a nice guide on how to use some of these features so this is how to do to go through a mini class on importing data in cipher and it's going to step you through the same example that I showed but it's going to make it interactive and executable for you in the browser developing apps so we have ap is for most of the popular languages when you talk to a database over the wire you're using a binary protocol we call bolt we support go Java Python JavaScript and a number of others we also have community support for a bunch of other languages are I mean all kinds of different things there is a cypher transactional HTTP endpoint so you can talk to near 4j over HTTP we generally recommend that people use bolt much as you would use JDBC for a relational database there's a link to language guides where you can get really simple just bootstrap me and get me started quickly with Python type code examples we have a native Java API as well where you can actually launch neo4j in memory and you can use the java api to define user defined procedures and access the core API so the way that you extend neo4j itself is typically in Java neo4j is written in Java and you can extend the cypher language by writing your own functions and procedure as much as you can and in other databases Boult high-performance its versioned it's based on PAC stream supports TLS and it also does a lot of connection pooling type stuff so if you're talking to a neo4j cluster you would like to talk to just the cluster and not worry about whether you're talking to node a or node B and that is called bolt plus routing and so there's a way that you can set your client app up to use a bolt plus routing driver to a cluster and it'll worry about you know routing the the query to the right server automatically for you so you just ask a question get an answer or forget the cluster topology you can extend cipher with user-defined procedures we mostly don't recommend people do this until they have exhausted the the other options with cipher so you can get really far by reformulating your cipher query or by right putting the right indexes in place sometimes you need to access some third party API or you need to do something extremely performance sensitive and you can write your own function or procedure to do that and then call that from cipher that just gets done with these Java annotations they're just Java classes that are annotated a certain way you compile this you get a jar file you drop it into the plugins directory and you're pretty much done and your server has a new extension talked a little bit about the migration of relational to graph because this is a really big topic for us because most folks are used to thinking in terms of relations relation relational it tends to be simple until it gets complicated you end up with all of these different joins and how many folks here have written one of those gigantic monster ugly sequel queries where you're joining like 8 9 10 tables yes yeah so you know you know relational and not consider the relationships themselves the way we think about the mapping between these two ideas is okay the the naive approach before you optimize anything basically think about all your tables and think about the table name turn that into a label and make the table a set of notes foreign keys become relationships that's actually apropos to the question earlier so whenever you see a primary foreign key linkage you need to be thinking in my model there's a relationship they're linked tables that are used in relational to resolve many-to-many mappings are basically just relationships with extra properties typically and then you throw out all the the primary and foreign keys that you do not need so sometimes you need a primary key to look up an item by its identity but you never need foreign keys and so you're storing a lot less data to begin with yes or the problem is suitable more for a relational database or for the database or maybe for another sequel names yes we're getting to your section okay so um I I can't defer that much longer we're get we're almost there okay so we know how to query a relational database we just use sequel and so we do these joins everybody's most people are familiar with joints how do we create a graph what we do is we Traverse we've we've kind of covered this so this this is starting to get into when do you use one versus when you use the other okay so our team a our DBMS is it turns out that they actually don't handle relationships that well because they they can't model or store them without complexity and what I mean by that is they're introducing artificial extra features like foreign keys where you are basically propping up the formalism in order to support this thing that you need to do ok the performance degrades with the number and level of relationships and database size and we can reason about this straightforwardly in that joins are computed at query run time ok so it's not that this data is pre joined but I have to scan table a scan table B match things up in memory and now granted 30 years of database research has gone into making superfast but you still have to compute it every single time okay query complexity grows with the need for more joins and as you add new kinds of data and relationships this is where you know you get into the kind of no sequel thing where sequel being fundamentally schema bound is good in the data integrity sense and it's bad in the evolvability and agility sense because it tends to be inflexible it's not terribly easy to define a new relationship on the fly and then rewire your graph okay so one of the things we would say is when data relationships are valuable in real-time traditional databases aren't the best choice and the reason for that is that you're going to be recomputing these joins every single time and you're going to be doing typically a lot more scans of a lot more data than is strictly necessary we can also this this gets to when would you use a graph and not relational suppose your question was find all reports and how many people they manage up to three levels down okay so if I could give you one slide that got me into graph databases this would probably be it I had this data lineage problem where I had a directed acyclic graph of data and the things used to derive that and the question that I wanted to ask was give me all reports that were derived from information sourced by the airforce don't care if it's two hops back or if it's five hops back or if it's 15 okay and oh man I spent a week and a half becoming a sequel palette and learning how to write recursive sequel using stored procedures optimizing the hell out of it not getting very far with it and so basically I went up the mountain and I consulted with the sequel gods and I got the absolute best advice all right and in the end it still was terrible okay so one of our field engineers I think they took this yeah this is an example from a previous customer who ended up coming to the dark side of graphs on anyway somebody actually wrote this query like I wish I could zoom in I mean I don't want to waste you guys time by reading this thing to you but it's a real query okay and there it isn't sequel and there it is in cypher now the magic here in cypher is that we can specify a variable number of relationship hops that we want to we want to traverse okay earlier on I said that traversing a relationship is fundamentally cheap and very performant in neo4j because of the way it's set up and we're about to get to how that works but here one line of cipher with some simple constraints and a return Clause gets rid of this much sequel why because the question and the data is fundamentally graphi so when you take a fundamentally graphic problem domain and you put it into tables typically pain results now if you had a fundamentally table based problem where you said I have an entry I just have a customer list of 300 million customers and all I ever want to do is pull out which ones are in this particular zip code relational has been optimized for 30 years to do that well and I'm not going to try to convince you that that's fundamentally graphi ok but if your problem is fundamentally graphi I think we can get there does it um so why graph this is basically about modeling your data naturally driving the graph model from the domain and from the use cases rather than from your college textbooks that tell you how you need to have it in third normal form in order to reduce redundancy whenever you need to use a relationship information in real time and whenever you need this flexibility to add relationships on-the-fly you're probably in the grass sweet spot so relationships are a first-class citizen and what we mean by that is the entire database is focused around relationships and traversing those right so it's not something that you tack on with scans and with foreign keys and with primary keys it's just baked in a you know an interesting way that I've heard in your for Jay described is imagine if your database had all of its joins pre materialized like they had been pre computed once ok then your your most of the way to a graph database we have query and data look part of the way that we can be faster is that we identify the starting points and move out from them rather than doing these massive scans only load what's needed aggregate and project as you go and then optimize disk and memory model for graphs so I'm gonna get to the index tree adjacency here in a second if you have a social graph with a thousand people and you average 50 friends per person you end up with this densely connected graph and if you ask is there a path from five odd to myself in this social graph but I never want to go deeper than four hops okay first of all before we talk about performance can we agree that that's a really ugly sequel query that's very difficult to write okay let's say we warm up the cache and we eliminate a disk i/o for both databases these are the observed values okay and it makes sense if you think about how these databases are built and that it is a fundamentally graphi problem this shouldn't be surprising or a hollow marketing claim okay so this is the secret sauce this is how it works and this is why I'm asking you to believe that this is very much a more performant database and it's not just marketing so inside of the database we use pointer pointers instead of lookups and so when you have a relationship it has a pointer on either end to the place in memory you need to go to find that note okay so when you traverse a relationship it does not scan all the nodes and figure out what's connected all right second is we have fixed size records and if you know much about disk IO that allows us to rip through a whole lot very quickly and to be able to do offset jumps and index seeks extremely efficiently joins on creation this is what I meant by pre materializing your join okay relational database is computing it every single time on the fly we computed it once when you wrote the data and then never again so there's fundamentally a computational cost that we don't have to pay every time you traverse the edge and essentially the secret sauce is that you just spin spin spin through that data structure over and over and over again and that is why traversing relationships is cheap because it's pointer dereferencing in the end so if we know that this whole concept the secret sauce so to speak online you'll hear us talk about this as index free adjacency and so there are articles you can look up on index free adjacency but this is really what's meant and so when we say that the secret sauce is index free adjacency this is why we are claiming that relational can't respond sub second for n way joins and that why we claim that relational is not agile is because it requires changes to queries in these new data feeds now back to your question you were asking about sequel server and so hopefully I've given kind of an overview about that cosmos is a completely different story because it's unrolling architecture is different but you'd kind of see how it structures its data differently as well this was not relation as far as I understand graph community groups I'm not sure what services they were the end of there I'm not familiar was that part of it but it's not a relational it's supposed to be like pure drop something are you familiar was that to me I'm unfortunately not I'd love to take a look at though okay it was added later recently at least I notice of like a months ago I'm not sure but it's it iodide another 17 yeah I notice it like Iran's ago maybe it was that yeah so there is something that when it comes to cosmos DB cosmos DB has a way to look at the data as a graph and I believe in sports clearing as gremlin yeah yes that's right but to kind of see how never 4j compares to cosmos especially when it comes to performance in the case for nodes would have huge number of relationships any conduct yeah millions relationship and there's in the bus that I was trying to use now for Jake what they're learning is that now G does not scale out especially that you can have as system was cluster which is a very two or three or several beefy machines with lots of memory and that's all you can have and if you want to add more data essentially you will hit the wall okay is just funny thing changed from the time I looked at now on 4 June 2014 so you had like three or four different things I need to unpack there so the first is the super node pattern which is the idea of a node having hundreds or thousands of connections ok it is it is true that super nodes in general within all graph databases are considered an anti-pattern now most of the time when we run into customers who have this problem the the problem is somewhere in the modeling layer and so you may wish to make a compromise such that you do not end up with nodes with this crazy degree now sometimes the problem can be reduced in another way so it matters a lot for example whether the node has 200,000 out edges of the same type or whether they're of different types because that of course affects selectivity in the database right so in general I wouldn't recommend a modeling approach that ends you up with hundreds of thousands of links per per node okay and I think you'll find that common to a lot of the graph databases now on the gremlin point so comparatively between neo4j and cipher I've used cipher and I've also used gremlin the I was kind of beating on this point about declarative graph query languages earlier in in part because of gremlin gremlin has a very imperative feel to it it has some declarative features but in generally you're telling it how to traverse things we find this pretty brittle and so one of the things that you're going to find with gremlin after you use it for an extensive period of time is that as your data structure changes you're going to have to go back and rewrite your queries because you told it which way to traverse and then that structure has changed right you're also going to find yourself optimizing the queries for the database and so you can do graph query with gremlin I just find it to be a lot more difficult sometimes less performant and less maintainable as well alright and there was a third point I think I'm forgetting scale I know about scaling out right so the scalability picture is still similar to what you remember the way that I would describe scalability in neo4j is that our cluster architecture has a leader and followers you get vertical scalability for rights and you get horizontal scalability for reads okay what that means is to guarantee all the acid properties you have one leader whenever you do a write it has to be processed through that leader so you probably can't process more rights than one leader can handle and so you can scale that reader that that leader up okay if you want to scale the graph out you have the option to add read replicas and additional followers to your cluster so there's functionally unlimited scalability for read queries out okay we covered this so Ambrose we're gonna get into learning resources we want to start some demos and begin with with Ambrose and then I'll show some extra stuff come on down alright so my name is Ambrose I'm from the Services pentest team of CDG and today I'm just gonna show you like a ten minute demo of how to import data from CSV into a neo4j graph and without using cipher so it's a little bit more intuitive let me quickly show you what problem we just deal with so during the reconnaissance phase of our pen tests we want to you know we have lots of data from different tools here on the Left we have a direct report to a manager right and on the right we have those aliases math - what file they recently opened so that's from the delve tool output ok so the thing we want to do is pretty much just combine them and see what it looks like in the graph okay so I created a little like helper library console application that will make it a little bit more intuitive for you to import data so let's take a look at importing the management data here so the CSV looks something like this and in my like little language thing that I made here pretty much you have to define some metadata about that CSV file so what knows do you want to create so here I'm saying I want to create a manager node from that first column and then I want to create a report node the direct report and then at the bottom here you specify the relationship which is manager which is this manager the relationship name which is manager of and then you know the second node which is report and then here I have the properties so the ID would be manager the name is manager label which is type is ms alias so I'm just gonna run this real quick I'll show you what that looks like wait hold on let me make sure that the graph is clear so this is what it looks like at the end I'm going to delete it so here I have neo4j running locally on my laptop okay so I'm gonna run the important manager on this and then you'll see that it's going to take that CSV file and then load it up over here okay so here I have the labels and you know names applied to these nodes okay so like later on we might get some more data about these people right so I'm going to import the second set of data which is the Delft data about what file they recently opened and then we're gonna see what that looks like yeah for Delft data down here so in this case I made my helper tool merge on the merge on the node IDs so if it sees a you know the user with the same like username it's going to use that node instead of creating a separate node so in this case I'm going to ID the the files by the URL and then the users are just ID by username like they were before and the relationship is user and then worked on file so I'm going to run this report no data okay refresh it now expand it here so then you can see that you know you can see which are the common files that two users I've opened and this is much easier to see then like looking at CSV files alright and yeah and then I guess you can create you know whatever you know cipher queries you want to you know traverse this graph if you want but this solves the initial hurdle of you know getting some you know a CSV data into a graph format all right and one last thing this is I wrote the tool here so you can find AKMs slash CSV to graph and you can you know download the console application version of this which is it's pretty much the same you just run it like this you pass it a JSON config file and then in jet in that JSON config file it looks something similar like this so you pretty much say which folder you want to use and then you know CSV files with name direct reports will have this metadata which is similar to what I showed you before and then CSV files with you know Delft would have this metadata and this tells how to you know create the neo4j graph so underneath the hood I do call cipher queries but like if you're just learning this for the first time it may be easier to just you know just define the properties like a like like so here yeah so essentially the library I wrote you know it doesn't necessarily have to be a CSV it just has to be a just some kind of this just has to be an inner opal this thing right it doesn't have to be CSV so maybe you can connect to a sequel database or something and then get your objects in the list and then you know import it that way so yeah both the library in the console app are located yeah so it's probably the least technical person in this room that was super impressive and that's super intimidating so David why would I choose to do this method for you seeis being it to me over Jay as opposed to the standard built-in Corey you're safe um let me ask a question before I before I answer that so did you read the read the files line by line and then create use like create an emerge or did you take the data from the user and then run load CSV in your code no I didn't run it's better bundled CS me okay custom yeah well so a really good reason why you would want to not use lewin CSV would be for example if you needed to do something in your programming language that was well supported in your programming language but it wasn't in cipher so I've seen users do stuff for example maybe you get some addresses and you want to run them through Google's geocoding API and which one put into your graph is latitude and longitude okay you can't call the geocoding API from Seifer but you can do that in a c-sharp program that would be a good reason right another is that you may know more about the form of the data and you need a really high-performance insert so you might choose for example like load CSV will say batch 500 records at once but if you're doing a zillion of these you might want to batch them in a particular way for performance reasons you know but I think the biggest reason would probably be programming language specific features where you're not really just loading data but you're transforming or enriching or cleaning or doing something else with it yeah so like load CSV is a pure cipher solution so you can do anything with that that you can do a cipher but start first on a general-purpose programming language so it's a graph query language Oh sometimes that makes more sense so they get that mean that really gets back to you the core question of what probably resolvable question try to answer yeah well you know this big technology there's always 15 ways different ways to do this I think and I did not necessarily wrong yeah um do you think if I'd asked you like in your learning curve when you first got started in your future can you recall anything about it just for people who are new that that you particularly liked or thought was good or sticking points where you particularly had problems or didn't understand concept I found that querying by relationships really helpful the fact that you can just like you know I just like the you know the querying like you had different levels you can create down three levels that was good what when I didn't like too much is about because around like creating nodes because it's kind of awkward you don't you don't really create nodes much when you're using the new forge a graph right so like every time I need to create something I kind of like had to reference back so that's kind of why I created this helper tool so I don't have to remember what the syntax was for creating stuff so yeah that's not something half of what engineers do for a living is automate away the the paint and the problems that they have so um I mean that that's really cool I the question is so you have this graph right and you can put all the CSV in it have you guys looked at any kind of analyses that you might do on top of that so it sounds like your use case is fundamentally like analysis if we know all these people in this department are looking at these documents did you ever have you looked into like what are the top three most influential things that that are the most widely read or you could say like if you knew who wrote these documents you could say internally who are the top five most read authors at Microsoft yeah those are good questions we haven't really figured out what the exact query is for that but David Baldacci or Stephen King with Microsoft probably lovers so yeah that's pretty much my demo you can get it at that URL and just feel free to reach out to me if you have any issues thank you alright so I'm gonna take you through a couple of different demo related things we're gonna talk a little bit about tooling and sort of show you how the software actually works just before we begin as I said this is on the azure marketplace and so when I showed up today I deployed a three node cluster here it is I've got my coordinates set got all my Azure resources I can show you logging into this later and show you how the cluster topology works but basically anybody can do this you go to the marketplace right here and I type in neo4j and we get lots of different options 3.5 dot 1 is the latest so neo4j Enterprise is a single node setup where you're going to get one machine and causal cluster is a multi node cluster setup and so if you're just looking to play you do not need three VMs to do that or nine or 15 it depends on how you want to scale but near Frechette Enterprise will do if you want the high availability guarantees then you would go for causal cluster 3.5 dot one right there and launch that so quickly I want to show you this tool this is called neo4j desktop and so inside of neo4j desktop you can create lots of local graphs and have multiple instances of the database running just on your machine if you don't want the cloud setup so here I've got a Microsoft demo graph which I'm going to start right now and inside of neo4j desktop you get these things called graph apps these little tiles up here at the top are applications that you can run against a neo4j graph and so what Ambrose was showing earlier is this application here called neo4j browser and this is kind of like a lot of the times when users first come to near 4 J this is their first point of entry really it's just a cipher command executions shell with some graphic stuff built on top of it so I saw him execute this query we're gonna do the same thing on on mine you're gonna be shocked to learn that I have movie data in my demo set for today and so I've got this big graph right you this tool will allow you to for example select nodes and I can for example change the colors of all of them and I can change how they're captioned and you know make some nodes bigger than others and so on and so forth it's a force-directed layout but basically it is a command shell you can run queries and then capture CSV directly as a result of this one thing if you're playing with this I would recommend that you check out the play command colon play it will run in browser guides where you can do learning and examples and tutorials so for example a lot of the stuff that I've shown you today if you do play movies and then hit enter it will step you through a little guide we're not going to do this right now but this is what the guide looks like it'll tell you how to create this graph and you'll see these all play buttons I can click on the play button and it'll automatically insert that so as they step through the tutorial I can execute stuff and start to play really quickly alright so that is kind of neo4j browser I'm gonna keep a picture of my graph up because we might need to compare it when we look at some of this other stuff we also have this tool called bloom bloom is for exploratory visualization so whereas browser is kind of for command execution run cipher get a result see the trouble with that is that you have for no cipher and you have to be kind of a data engineering sort bloom is a more natural language focused tool with better visualization where you're going to start in a particular point not execute an analytic query which looking for patterns you're looking to identify bigger issues so I can say for example person named Tom Hanks and you'll see that it interprets that as a graph pattern it thinks what I mean is match a person with the name property Tom Hanks that is what I mean here's Tom let's command e to expand him and we blow out just his immediate hops and then we might say ooh you've got mail was not a great movie let's skip that one oh let's see what was a great movie Oh Joe versus the volcano everybody anybody seen that that's some knowing laughs back there all right so Joe versus a vote it's about a man who gets convinced that he has this fake brain condition and he's going to jump into a book anyway I can't I can't go into that but let's expand Joe versus the volcano because it's a funny movie we've got nathan lane john patrick stanley and meg ryan who were in that movie they're all connected to Joe versus a volcano and now let's say we want to just we can then explore further and say I happen to know that meg Ryan was also and you've got mail so if I expand meg Ryan look at that Sleepless in Seattle pops up as a movie they were both in and you've got mail so we can sort of see the commonalities let's pick Tom Meg and you've got mail and dismiss everybody else and then refocus so this kind of GUI is basically meant to do these kinds of exploratory visualizations a lot of our financial clients for example they might have an idea of how fraud is happening within a financial graph but not really be able to quite nail it and so you can use this for hypothesis exploration where you say I think this is the pattern that's happening now let me go look for evidence of that and then rinse lather repeat to build an overall picture of how the network works you know as Ambrose said earlier it's just so much easier to actually see this data as a graph sometimes that it spurs a lot of discovery in that way so that's that's kind of a quick overview of Blum we also have a graph app here called Halon which is for monitoring let's actually let's close this one and do instead our cloud instance see if we can get that working here inside of neo4j desktop I've got the second tile as your deploy that is this deploy that I set up at the very beginning of the meeting and I'm going to activate this as a remote database not running on my machine and I'm gonna open my little Halon graph app and what I'm gonna see here is a lot of operational metrics about a multi node cluster so along the top you'll see these three IP addresses are the the machines that are participating in my cluster the one with the little star is the leader of the cluster that takes all the rights and the other two are followers and basically this is this is a rather in-depth program we're not going to go all the way through it but you can hover here and you can sort of see it's got this green status everything's looking good that's because it's basically not doing anything right now the cluster is idle and then we can look at individual machines and for example take a look at what the the memory or the disk on this machine is doing what plugins do we have installed and how is it configured so for example I can type in here whoops I can see what addresses it is advertising to the broader Network one last thing about Halen is in the Diagnostics tab you can run Diagnostics and it will gather a lot of information about your cluster and give you recommendations on what looks good what might be misconfigured what you might want to look into and this sort of helps users diagnose some of the most common problems associated with configuring clusters and so on okay so any questions about kind of like the graph app concept we have a an API online so that if you wanted to write simple applications sitting on top of neo4j this is a pretty good way to do it in the end they're just JavaScript programs typically with a certain JavaScript stub API injected that allows you to connect to and talk to the the graph with a bolt driver instance okay yes yes it's the same okay the only difference here is that it's wrapped inside of a graphic but it's the same yes all right so that's kind of neo4j the nickel tour of neo4j desktop let's talk a little bit about the analytics and data science parts I want to show you a simple or maybe not so simple query this is it right here and gets to the right browser tab okay so we bootstrapped a whole lot of knowledge into you today about what graphs are what cypher is how all this stuff works we have this library that comes packaged as a plug-in called neo4j graph algorithms or just call it graph outgoes for data science groups data engineering groups really really recommend you take a look here because this is all of the graph e data science stuff here that you're not going to find in other libraries and that neo4j makes really easy that these sorts of things are going to be quite difficult in other systems so the graph algorithms plug-in is basically a compilation of a lot of different algorithms across a lot of different use cases this documentation I've always thought it's pretty good because not only does it describe like the syntax of like how you're going to call this or that algorithm but it also goes into okay what is Lu veining about and when would you use this and and you if your problem looks like this then you probably want PageRank instead of something else um some of the most fun things that I have done working for neo4j have been interacting with this so quick story you know neo4j has this data journalism outreach program where we use graph technology work with journalists who are doing investigations help them get answers and then they write up the story and publish it and when I was brand new to neo4j I got to work with NBC News they were doing coverage of Russian Twitter trolls manipulating elections in the United States and using the graph algorithms package we loaded a whole lot of Twitter data in that was given to us by the sources that the journalist had cultivated and we were looking for community issues that is how what topics were the troll is most frequently talking about how did they break down a loss certain demographic lines and all of that social network data is fundamentally graphi and lent itself well to the algorithms that are in this package getting to play with that and doing those kinds of data analysis is really fun primarily I was using at the time the community detection algorithms we were looking at all the trolls and saying if you got rid of all the civilians and you looked at just the people who were involved in this instigating behaviors who was talking to who who was retweeting who and it turned out that they sort of broke into multiple communities prior to the analysis and the publishing NBC did the thought had been there's this big group of trolls and they want a particular candidate to win and that's really the end of the story and it just wasn't at least that's not what was in the data we found that there were a group of trolls who were aligned with right causes a group of trolls aligned with left causes and a third group and that the overall modus operandi if you will like what they were trying to do was more about social division and less about advocating for a particular person in that election and so that that stuff got written about in NBC News and we don't have to go do too deep into it but for for data scientists people I just want you to know that like what went into that reporting was neo4j and graph algorithms and this is where you should start if you want to do data analysis with neo4j so as a simple simple example I did a query for harmonic closeness on my movie graph this is just computing whoops Oh a I didn't start my database that's that's going to help databases are much more responsive when they're running so let's take a look at this query um basically what we're getting is this metric coming out of the harmonic closeness algorithm the metric is centrality I believe with this particular algorithm the metric in and of itself is not meaningful so it's not like two point point two eight six means something but magnitude and directionality is meaningful and so this is basically about when you look at the graph how central is a node or how influential is it and the overall flow of information throughout the graph and so the way that this query works is that call algo de closeness harmonic that stream and so fundamentally when we use the call key where we're calling a stored procedure we're giving it some parameters where we're looking for nodes and acted in relationships and then basically we yield some data and then basically match nodes in the graph that were yielded out of this harmonic closeness and then return whether it's a movie or a actor we order by centrality in descending order to find the most central nodes first and limit to the top 20 and our buddy Tom Hanks comes out on top and the reason for that is that when we created this sample data set it was all about it was centered on Tom Hanks and we tended to add data around him right so it's not really surprising in this particular data set that he comes out as very central also not surprising is a lot of stuff that he's in shows up being central because as we expand it out and added to this this data set we tended to tack on you know co-stars and directors of the movies that he had been in right so if for example you did this on the entirety of Twitter you would probably find at the center PewDiePie Lady gaga and all of those right because that's who people are retweeting that's who's sending out the most content and that's who's reference the most often right so um one thing I can't really do justice in less than a full-day class is the depth of how much is in this algorithms stuff so basically we we kind of group these two families of algos if you will so sometimes what you care about is centrality and that's what I just showed you sometimes what you cared about is community detection so I wish I had a I kind of do have a whiteboard maybe the cameras will follow me so sometimes if you have a graph like this let's say you have a graph like that let's say your graph looks like that totally made up but you can sort of see that while there are a bunch of nodes and edges there are three distinct communities in there and so that the family of community detection algorithms are really trying to what they're trying to do is give you this one two three a lot of these algorithms as with many machine learning algorithms and other things that you've used they come with a lot of tweakable and tunable parameters right so if you fuzz it enough hey it's all one community and if you if you make the community stark enough then you have as many communities as you have nodes here right and so as with many other algorithms that you're going to work with there's some some tweaking and tuning according to the domain that you're dealing with and how specific that you want to get but that's kind of community detection now if you look at this this particular graph if we ask about the centrality of this graph you don't really get much meaningful out of that because there isn't really any node with the possible exception of this guy that's really central in this graph right on the other hand if the pattern that you're looking at is is sort of like spoke and hub like this and you run centrality a centrality type approach on that here we can clearly see even visually even without a fancy algorithm what is the most central node in that graph right and so this is where and I know you guys know this from data analysis it's just that the technical features offer us all these cool algorithms that we can run but then we have to have a whole lot of domain understanding of what our data is so that we can fit this abstract math thing that works that does something really cool in software and then fit that to our domain so that we know which algorithm to use and when that is the real art that's why we get paid the big bucks right yes just a use case that we're looking at we won't actually colors of nodes based on some here you huh let's be a catalyst for the graph the basic needs to be APS ah both depending so there are path finding algorithms that's a whole class where that might be what you want to do it depends on how sophisticated your paths are sometimes what you want to do is sort of like an iterative algorithm where you find a path and then you set property on that path and then you just sort of iterate on that and then you expand it out time and time and time again expanding out which nodes get this magic property each time you know what I'm saying yeah so it kind of depends on what you want to do but the path finding algorithms are a help for that but in many cases because ciphers kind of good at that out of the box sometimes the path finding algorithms are overkill for that and you just need some simple cipher can you say a little bit more about what kind of paths do you want to find and what meaning do they have for you yes actually I look on the build up your building our clusters for as you oh okay cool those build outs have a certain series on jobs that happen AHA and these jobs are the dependencies that would need to go through as we traverse through that and so sometimes a certain jobs get locked or delayed then we want to we figure the oh man so that's a dependent yeah that's an independancy graph so you're talking about something like a looking arm template is that the kind of job I mean it's Thomas essentially you have it bust alive and you're already doing it this is still an ad should have been open behavior so you got some job one that triggers job to you know triggers job three and then you get to this good place right yeah but job one also triggers a four which triggers J 5 which is needed for J 3 and J 5 is blocked right Thank You chief yeah yeah so um let's we'll just put a green circle around and say job five is blocked right so this is very similar to like the network and IT operations use cases where sometimes what people are doing with graphs here is these are routers and machines and what they want to ask is what in this chain if it fails we'll mess up my ability to deliver this okay and so basically what you might want to do is ask a query like start-finish alright okay and then so you want to enumerate all paths all distinct paths from start to finish and then you want to go through those paths and ask if any component on that path is blocked right that's fairly straightforward to do a cipher right because we know how to match a path we know how you can use the distinct keyword to say I want distinct paths and then you can use functions in cipher to pull out the nodes or you could say match a uses I'm just making this up as we go star B where B dot blocked true and then this will get you P equals returned P I'm just going really fast hopefully this is legible you can say match all paths where I'm going from some starting point to some node that note is blocked give me that path and then you might further further this and say B no I'm getting really small and then say finish right and then that right there would identify the blocker right well yeah that's right that's why I was thinking that pathfinding algorithms might be Overkill's cuz you can do this sort of thing right and then you can sort of you could then for example color that node by setting the new property on it yeah and then if you did that say every 30 seconds then separately scan what are all of my blocked guys and and then have some resolution approach for that does that answer your question yes similarity algorithms and pre-processing functions and procedures yeah we're coming up to the end of our time here I want to make sure that we are respectful of everybody's time and leave some time for Q&A I've gone through a lot of algorithms and a lot of stuff like I could talk for hours about just this piece but it gets really deep really fast and as an intro topic I just want to leave you with this as a point of exploration Oh finally when you get this is only gonna take a minute or two when you get lost where do you go how do you get help this is don't worry about writing these links down or anything like this because we're gonna distribute a copy of the slides so all this is going to be clickable for you everything that I've told you today you're going to be able to find in written documented form out here on our developer pages we have a graphic adam e where you can get self-paced online training we have a near forget j certified program where you can go through an exam process to get near forge a certified become a professional and get that sweet sweet neo4j swag right I mean this being the techie industry I know you guys don't have any t-shirts at all and you're desperately in need of some and I want you to know that we're here to help okay so the one hour certification exam covers a lot of the topics that we've covered today and goes into a little bit more depth we told you earlier about the cypher ref card like if you only had one link how do I do something in neo4j a single link don't even want whole developer manual this would be it this is the cheat sheet of everything that you need obviously we have developer documentation operations manuals for how to run clusters we have a knowledge base as well so paying customers get access to a lot of extra articles we have a very large number of public articles as well as well as frequently asked questions their version specific so you can find out about how this worked in this older version whatever you need lots of sample applications that you can get clone compile and go we are very passionate about our user community and we have regular meetups all over the world we have a community site community neo4j comm where you can get connected with some of these and you can meet some of my favorite fellow graph nerds that's also a great point of entry to just ask a question we have a fantastic developer relations team that tends to jump on helping people who are out there and wanting to get them started and have a good early experience with graphs you know neo4j group grew up in the open-source world so if you want to know I don't know maybe you wake up late on a Friday night you know when you should be asleep and you say how did they implement index tree adjacency I must go to github and know you can do that we have Stack Overflow you know one of the early ways I got involved in the community is I answered a zillion questions out there on Stack Overflow and so a lot of different things that you might want to do have already been covered books man it's endless okay and some some colleagues who who I have tremendous respect for we have two people Marc Needham and Aimee Hodler are in the process of publishing a book on graph algorithms and so it's going to be a very deep dive on all the stuff that I can't cover today for for time reasons so when that O'Reilly book comes out I really highly recommend you pick up a copy because bigger brains than me I'll tell you that anyways that brings us to Q&A I am I I want to be here to help you have the smoothest early process and get any of your questions answered and please don't spare me the hard ones UNK it in the community has anyone attempted algorithms which are like useful probabilistic graphical models on neo4j like the enforcement learning of mandates those kind of yes I am myself not super deep on that I'm not sure I can answer that really thoroughly but but yes there's a guy named Andrew Jefferson who has written a couple of posts on this topic and Tavian a I think is the group that he's working with and so that's a fairly deep rabbit hole like how does neo4j connect to deep learning and and a lot of those related topics that's an area that we're increasingly getting into there's some publishing out there right now and there's more to come was in the plans neo4j anything with RBI power bi um so I think in in community open-source there's a number of ways that you can do that right now there is a JDBC driver where you can write cipher and expose a table okay there's not a power bi specific integration that I'm aware of but we are always you know I work in partnerships and so on and we're always looking to hear about what integrations are most valuable and why we'd love to talk to you more about that offline so you guys are troopers it's like five o'clock and you've been listening to graph stuff for two hours yes you mentioned the problem or super nodes those worry what other problems people are typically running into and what are the early symptoms of oh that's a good one and that's not amazing one either all right okay it took me seriously when I said ask the hard ones okay all right um anti nodes are definitely uh the super nodes are definitely an anti-pattern um near 4j does not index relationship and or sorry relationship properties and so an anti-pattern that I see is over reliance on really dense property metadata on the relationships okay so for example if the way that you design your model is with very thin nodes and very fat relationships with 20 or 30 properties and then you want to write a lot of queries that are very selective on relationship properties this is generally not a good idea okay that's that's one anti-pattern let's see super fat nodes where you don't split things out is another anti-pattern so recall that I've been saying all along relationships are very cheap to Traverse right and so let me take a super simple example let's say that I have a customer in a relational database we're going to give them a name and a state and a city let's just keep it that simple okay in relational databases we tend to think about tables and so what we do is we pile lots and lots and lots and lots of properties on on and this breeds queries where you're going to be searching for all the customers who are in the state of Virginia I'm from Virginia so I use that as an example a better way to do this is to take your low dimensional categorical variables and to break them out into nodes so for example I would not do it this way I would do it like this customer name has state has city okay so an anti-pattern is having lots and lots of really fat nodes and not that many relationships in this way you are defeating what the database does well okay and you're falling back on your old relational scan and filter skills does that make sense um was there a second part to your question that I missed or did I answer it early symptoms early symptoms um but bad query performance and also the early that the other symptom would be your graph model looks exactly the same as your relational model that would be a giveaway so if you have these different affordances and different flexibilities it ought to look at least a little different all right it was a question the men say yeah probably so actually yeah um this would help distinguish between Richmond Virginia in Richmond California which I can tell you are very different places yes it's kind of awkward to like model the times here say that in some way this is perhaps like what's your recommendation generally after you were like adopt different data put based models that great kind of thing right yeah oh man you really can't up the softballs for me okay so in earlier versions of neo4j it used to be that we recommended this time tree approach and that's probably what you're talking about so in the time tree approach you would do just for folks who have not seen the time tree approach you would have a 2019 node linked to a January node linked to a what is today anyway the 24th day and then you would link that to a customer call I don't know let's say a customer called and this would be a time tree and so if you wanted to find all the calls at hand in January you would match to the January 2019 and then get all the notes from their neo4j 3.4 introduced a temporal data type so you can have times and date times as a native data type on a property and we have optimized indexes for that and so when the temporal types and temporal indexing came out we don't do time trees anymore you just put a date time you index it and that's that for the same protein so just stack that on the same for a few as P times M so if you have a history of values there's a different pattern for that so we basically you turn if part of the problem of your domain is that you have the customer today but you want to know every state the customer has ever had in the past then we're going to basically model the customer as a linked list and the head of the list is always so taking our example right here and sketching it out so you've got this customer call on one 24:19 and then we'll have a next link and then we'll have the same one on one 23:19 and so on and so forth so when when people need to do revisions and graphs what they do is they basically never modify the node they create a new one with duplicate data but whatever they need to change and the old one gets pushed back on the linked list does that make sense and in this way you can traverse the chain and it's like a time machine right you can navigate to that node through the query and then you can go back however many iterations you need until you hit some time point first a few additional was basically just to a extra attributes that are appropriate yes sir it's just another property it's just another property that has the date time datatype okay all right yes um you know I'm let me get to your question in just a second gonna just pull that up because why not to do so yeah so that's what I'm talking about it's just a property data type yes mr. Larry so in that example is sort of what the no theater would be accustomed another customer quality oh yeah what the what the different property of a different gate that's exactly right yeah it would be a different property of a different date but otherwise the same data and in this way we would basically be keeping every revision of the node that we had enabling the time-machine aspect um um call that get four graphs you know cuz you can even branch with it you really can I've seen people do that um any other questions any online all right thanks a lot guys we hit our time while we're over by two minutes [Applause]
Info
Channel: Microsoft Research
Views: 45,980
Rating: 4.9058242 out of 5
Keywords: microsoft research
Id: oRtVdXvtD3o
Channel Id: undefined
Length: 116min 54sec (7014 seconds)
Published: Mon Mar 25 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.