Building a Real-time Recommendation Engine With Neo4j - Part 1/4 - William Lyon - OSCON 2017

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so this is the building a real-time recommendation engine with near forge a workshop if you're in the wrong run you still have a few minutes to go somewhere else these slides are online down at the bottom if you can't read it that's a bit Lee / neo4j özkan slides all one word and also has all of the other links will be using will be using today if you didn't get a chance to install me r4j that's fine we're going to use the neo4j sandbox today so you won't need to will need to worry about installing neo4j we are going to do a session a section in Python so if you have Jupiter installed in the slides here there's a link to a Jupiter notebook and we'll use that but that won't be that won't be until apps well after the break so for now let's go ahead and get started so my name is will I work on the developer relations team at neo4j I work on things like building integrations with near 4j and then also making sure that our users are successful with neo4j right now I'm working on a graph QL integration for near 4j and any graph QL users in the room now okay so my my contact details are here feel free to reach out after the workshop or during AHS con happy to happy to chat with anyone oh I'll be here all week so what are we going to talk about today so the rough schedule is for the next 30 minutes I'm going to give a brief overview of neo4j and recommender systems and then we're going to dive right in to actually using neo4j we're going to learn about graph data modeling data import we're going to learn cipher which is the query language for near forge a so the goal is for most of the sessions to be hands-on with you working with neo4j writing cipher queries we have a data set that we're going to work through to see how we can import model that and query it just curious how many people have used near 4j before okay maybe a third or so and how many people have used another graph database other than the air for J okay cool so this is the rough outline the the break is at 3:00 3:00 to 3:30 I think if that's wrong someone let me know please great thank you okay so for those of you not familiar with near 4j inferred is a graph database that means that we model store and query our data as a graph specifically the data model that we use is the property graph data model we use the cipher query language which is part of the open cipher project so it's an open query language there are other database projects that implement open cipher as well and then there's this concept of native graph processing that's really important for graph databases so what this means is that when we traverse the graph so traversing a graph is the primary way that we interact with it when we write a query we go from from one node to another that's connected to by following a relationship this is traversing and the concept of native graph processing means that we don't have to look up an index when we do this traversal so essentially when we do a traversal we're just chasing pointers that means that the performance characteristic of a local graph traversal is then the same for what a graph with a thousand nodes versus one with a million nodes so this is a really really important concept when it comes to performance this is something that that is very important for graph databases so near 4j also has lots of clients or language drivers and in lots of different languages later today we're going to be using the Python driver for near 4j and then finally just want to point out that near 4j is open source all right all the code is on github we can build it from source we can download it and get started cool so that's a brief overview neo4j and sort of graph databases in general let's dig in to the graph data model and specifically the label the property graphs model so these are the basic components of the property graph model so node these are the entities or the objects we can store arbitrary key value pair properties on nodes and then nodes have one or more label in this example the label is node which is not very helpful but the you can think of the label as the tie for the the clasp of the of the entity then we treat relationships as first-class citizens in in the property graph model so relationships are modeled explicitly connecting to nodes relationships have a single type a direction and again arbitrary key value pair properties so if we were to map this to the sort of language semantics we would say that nodes are nouns key value pair properties are the adjectives that describe the nouns we can think of relationships as verbs that connect nouns and adverbs are the property is on relationships that describe the verb so just another way to think of our data model so what kind of data can we model as a graph well here are some examples so we could model information about businesses and users that have reviewed the business businesses are in a certain category and we also have a social component here so users are friends with other users we could also model information about credit cards and credit card transactions that occurred at some merchants that's in a zip code where we have some risk score associated with the card and this type of data model might be useful if we're interested in detecting fraud we could also model information about companies and and VC funding their business models what investors participated in various funding rounds and this might be useful if we're trying to for example see in a given City if there are clusters of companies in a certain industry and the VCS that are funding that maybe we want to know what are the right VCS to pitch given the startup that we're thinking about launching this is actually the the data model actually the data set that we're going to use today so how many people are members of meetup comm quite a few cool so if you're not familiar we're going to talk a lot more about meetup once we start working with data but the basic idea is you're a member of one or more groups groups host events members attend events where you you meet each other you learn about cool topics and these events occur at some venue and so what we're going to focus on today are how we can converse of all import data using this data model and then how we can query this data set to may be recommend groups that you might be interested in joining or events that you might be interested on attending based on topics that we've inferred that you're interested in and maybe some social graph that we've also extracted all that information can can play a role in the type of recommendations that that we want to work with and then another really common use case that we see people using graph databases for are in the general category of IT and network operations so you can think of all of the components in in a data center from racks to switches to routers to the actual actual Hardware the virtual machines this is all sort of a big dependency graph and modeling that data as a graph starting in a graph database it's very useful for doing things like root cause analysis dependency analysis if this service goes down is there a single point of failure what applications will be impacted if one of my nodes and a database cluster goes down these kinds of things and you can see that there is a link for all of these these data models and so these come from real example that you can you can play around with so if you get the slides you know feel free to dig into a use case there that you might find interesting so really we think the you know graphs are everywhere once you start sort of thinking in terms of how you can model data as a graph you start to see graph problems different places that you look and this is really something that you know hopefully we can can get across today cool so that's the the basic idea of the labeled property graph model so now we need to talk about how we query a graph model with neo4j we use the query language called cipher which I said is part of the open cipher project cipher is a query language for graphs you can think of it as sequel for graphs but designed for graphs so what does it look like we're just going to look at an example because we're going to be spending a lot of time later on like learning cipher and actually writing some cipher queries but here's an example so on the first line here we're saying match so this is kind of like a select and then we give the match statement some graph pattern and we can these these patterns are defined in this kind of ASCII art ASCII art sort of diagram so nodes are defined within parentheses and labels follow a colon labels and relationship types follow a colon so in this case we're saying find nodes with the label movie bind that to the alias or variable M so we can refer to it later now follow incoming rated relationships to find the user that reviewed these movies and then I have a where clause and now I want to filter where the title property of that movie contains matrix and then I want to do a group by the title do an aggregation for the number of reviews and return the movie and the number of reviews ordered by the number of reviews so essentially this query says find me all of the Matrix movies find all the users who rated those Matrix movies and tell me what matrix movies have the most reviews so this is just sort of what we just talked through breaking down the different operations in the query so that was just an example there there's lots of more content for digging into cipher on on the web near-50 comm / developer is a good resource maybe has lots of lots of code samples in different different use case examples using cipher so that's a good place to start so let's talk about recommendations since that's the focus for sort of the use case that we want to dig into by the way does anyone have any questions and feel free as we go throughout today to to sort of raise your hand or shout out if you have questions I want to make sure it's as sort of as as interactive as we can make it with a big group here but any questions of things we covered at this point yeah industry in the city yeah yeah that's a good question so the question is related to the direction of relationships and sort of is there is there a standard for which way the direction goes right we here we have company has office in City you know we could also be some do something like city contains company or something like that I I think generally the guideline is if you read it as a sentence that should make sense right so this company has an office in the city yeah okay that makes sense and and and the other guideline is to be relatively consistent I think as long as you you follow those two guidelines and I think you'll be okay yeah yeah good question anyone else so what when yes so when we write the query we can we can write the query using go back to our cipher example we can write the query to to traverse either way and actually we don't even have to specify a direction at query time sorry so the the question was if we if we choose to model a relationship direction one way are we then limited and insert how we can can query it and right so every every relationship has a direction and we have to store a direction but we don't have to specify a direction necessarily when we query it and a good example of where we we maybe don't care about Direction is to compare say like Facebook friends versus Twitter followers right so you can follow someone on Twitter and maybe or maybe not they follow you back so in that case you need to model that relationship as directed and if they follow you back and you can model a second relationship coming the other way but on Facebook when you're become friends like you both have to accept that so we don't really care about direction with with that type of relationship so we might not specify at a query time we just want to know all the people that are friends we don't really care who friended cool yeah yeah so the question is can we have more than one relationship connecting those and yep absolutely yeah and we can have you know more than one relationship even of the same type between two nodes maybe you know a user views a web page and that has a time stamp on it and then I do it again and because I'm storing that timestamp maybe I want to create another relationship and that's that's perfectly valid yep okay so let's let's move on to digging in to personalized recommendations so this is this is something that I think we've all been exposed to Netflix and Amazon are probably the most talked about and when you're sort of looking for examples of recommendations so these are movies and TV shows you may be interested in based on things that you've watched previously here are our books or products that you might want to purchase based on your viewing or purchase history or based on what other people that are viewing similar things to you are looking at and purchasing and it's very obvious that recommendations drive user engagement with your with your application with your service just grab this quote out of out of a McKinsey report that said that 35% of what consumers buy on Amazon and 75% of what people watch on Netflix come from recommendations so it's certainly something that that increases engagement and in some cases revenue let's look at sort of a hypothetical example of the type of data that we're working with and maybe some challenges when we're building out a recommendation system for it for the enterprise so here's a website we're shown some personalized promotions so the Dreamhouse series is 15% off and then we're presented with specific product recommendations so people who bought this thing that I'm looking at also bought this other thing and then similar products in an office series that I've that I've looked at right and this is dynamic content that's generated based on based on my engagement with this website and if you look at the type of of algorithms that are used in recommendations or there's basically two extremes so collaborative filtering with collaborative filtering we're looking at data about users and items that they've interacted with and essentially it boils down to finding similar users in the network making the assumption that similar users are interested in similar things and saying you know these things that similar users are interested in you're probably interested in them too here's the recommendation so that's collaborative filtering and then content-based recommendations are essentially looking at metadata or information maybe about a product catalog or some sort of concept hierarchy which we're going to talk about in a minute but basically extracting information about the content and using that to generate the recommendation so if we think of the data model that we would use for these with collaborative filtering we're talking about user product interactions and with content based we're looking at sort of extracting information categorizing products that type of thing and I say these are the two extremes because in reality like most recommendation systems that are actually implemented out there in the wild are some hybrid between the two and we'll see how we can sort of combine both approaches going forward so if we think about the typical enterprise that it does maybe brick-and-mortar sales and e-commerce as well we think about sort of the different functions and where the data lives that we need to access to to generate these type of recommendations so we need information about purchases that might be stored in in a relational database like my sequel we need information about product catalog we might be storing that in a document store like we need information about you know what what the user has in their shopping cart that might come from you know Redis or some cash so really you know we have this type of information spread across lots of different database systems in our enterprise so in order to to be able to query across these piece them together we need some way to bring this information together and you might say oh yeah I I've seen that before that's the data Lake right so we we put all of our information into this data Lake it's great for you know running MapReduce jobs it was on Hadoop or something for you know sort of analytics and bi and it's true sure data leaks are good for that but the problem is that you know these are typically things on the doop that you know I can do MapReduce and that's about it which means that that's sort of the overhead for generating a single recommendation for a single user is sort of beyond the the speed that I can have that on the hot path for their recommendations on the store comm right on my website so if we if we look at sort of a graph based approach of this what we often see is that we can bring in information from our relational database that has purchase information from our document store that has product catalog information we bring those into a graph database like neo4j and then because of this idea of native graph processing so we talked about this idea of index free adjacency being able to to query in sort of a constant time performance characteristics across very very large data sets because we have that with graph databases now the query for generating our recommendation can be on the hot path for showing personalized recommendations to our user on the website and why is this important well this means that we can then take into account all the information we have about a user as of that point in time when we're generating the recommendation if we go back to sort of the the data leg or Hadoop approach sure we can generate recommendations you know across all of the data on enterprise that we have about a user but we have to do that you know nightly or hourly that's that's done in bash right so if I've just bought something or if I've just put something in my shopping cart I need to take that information into account at the time that I'm generating the recommendation so that that that along with sort of the flexibility of the graph model for sort of integrating all these data sets I think is a real advantage especially when we're talking about building recommender systems so when we're talking about real time this is sort of what we mean being able to to have these types of queries on the hot path for personalized recommendations cool so let's look at just sort of some simple examples of how collaborative filtering first then we'll talk about content-based recommendations what simple versions of this look like in a graph so there's a will node will purchase the book called data structures Emilia also purchased this book data structures in and Emilia purchased this other book advanced no sequel so if we're going to make a very very simple recommendation giving only this information that we had we could say well will and Emilia bought the same book they have some interest in common what has Emilia purchase that will hasn't oh this advanced in a sequel book this might be a good a good recommendation for will and we can write that in cipher it looks like this so find the will node traverse out along this purchase relationship to find books that wills purchased who else has bought the same books as well what other books are those people buying that will hasn't bought recommend those too well so that is the sort of simplest form of collaborative filtering there's a lot of problems that we're missing here there's no sort of scoring or normalizing we don't really have a measure of how similar to to my preferences these other users are so there's a lot we can do to improve here but this is sort of a good basic start now compare this with a content-based approach so now will bought a book data structures and it has a tag or it's in this category of big data what are other books in this category oh here's this advanced no sequel book that is a book that's similar in content to a book that will has purchased let's recommend that to them and we can write that in cipher this is almost identical to the previous query except we're just changed some of the the labels and relationship types but the same same basic concept now with with content we can introduce this idea of a hierarchy so we know that big data is maybe a subtag or a subcategory of data and another subcategory of data is databases so we can infer that there's some similarity between the category big data and the category databases so in my product catalog I can go up one level so I don't just have to recommend Big Data books too well but I can also recommend databases books books because I know that that category is similar to Big Data because I have some hierarchy that I know about my products and again we can write this in cipher there's just another another traversal here but again very similar to to what we saw previously and it's important to point out here that if we don't sort of have this information if we don't know sort of what the hierarchy of of our book categories are that's okay because we can use things like some natural language processing and graph clustering which we're going to do in a little bit here to try to infer hierarchies like this based on based on sort of just text that we have that describes products and again this is just just a very basic basic example to get started and we're going to improve this and we actually start working with real data cool I just want to point out a couple of resources one is near forge a sandbox so near forge a sandbox is really cool it allows you to spin up a near 4j instance that's personal to you you can choose from lots of different data sets and that data's already loaded and it sort of has queries to sort of guide you through there this is what we're going to use today so even if you have neo4j installed locally we're going to use a custom near forge a sandbox instance that has the data set that we're going to use and sort of laid out in a structured format for going through the exercises hopefully everyone's able to use that today and then I also want to point out there's an O'Reilly graph databases book that talks a lot about the data modeling things that we talked about today how to build a graph database application so after this workshop if you want to read more about it that's that's a great book to get started with okay cool so that's that's enough of me talking so let's actually do some hands-on stuff with near 4j so a little bit of the logistics so we said that the slides are online the ones I'm going through now you can get a bit dot Lee slash near for J ah scon slide all one word but for right now the most important link on here is this first one which is bit dot ly / near for J Oz Khan and if you go to this link what this will do is it will open up a hidden nearby J sandbox use case that has all the material that we're going to use for the rest of this workshop so everyone please right now go to bitly slash near for j oz khan and when you do that you'll see sort of the the sign-in process for the average a sandbox right so you can sign in with Twitter LinkedIn Microsoft github whatever else is on there or you can just use use email to create an account if you have any issues signing in to that I have all the materials on on USB keys as well but it'd be a lot faster and easier if you do that if you have any issues getting into that flag me down in a minute and we'll get that set up for you and then the other link on here so we said that we're going to use the Python driver for any for Jay and some of the sort of data science tools to do things like extract keywords and do some graph clustering some community detection algorithms so there is a Jupiter notebook that we're going to use and that's at vide slash neo4j notebook but we'll get to that I don't know maybe in about an hour after after the break cool so for now just going to walk through what sort of the process for getting into the sandbox and make sure that we're all where we need to be before we move on so when you go to the fitly link it will sort of redirect to this page that says get started with grass start now login you can can log in you know with Google LinkedIn any of these or just create an account email once you do that then you'll see a page that has lots of different sandbox instances and we're going to use the osteon 2017 neo4j workshop use case so click on launch sandbox and then you'll see some sort of graph trivia while that sandbox is spinning up and what that's doing its provisioning a near 4j instance in a docker container on some AWS ec2 instance and it's sort of making that private to you so once you once it spins up you'll see this this page says get started visiting the average a browser with a link and you click on that then that will open a new tab with neo4j browser and sort of a guide for working with meetup data so the everyday browser is a query workbench for near 4j so developer tooling that we can use to to write queries and visualize the results and it's also very useful for embedding interactive guide content which is what we're going to use today so on this image there sort of nine sections each one of these we're going to work through today probably won't have time to go through all of them hopefully we can get through at least the first four in the next hour or so and then you know you'll have these when you go home feel free to work through them so the goal for right now is that everyone should be on a screen that looks like this with that meetup logo and this thing that says our schedule for the day so please you want to follow along please go ahead and do that right now that's sort of where we need to be in order to to move on so let me just go ahead and do that so I'm going to go to bit dot Lee slash near for j oz con and let me already had one spun up here so let me log out and sign it again this is okay you can sign in with Twitter great and now I'm presented with a bunch of different use cases and I'm going to choose the Oz con neo4j workshop use case so I click on this and it's spinning up a new year for J instance for me on AWS and once it's ready it's going to give me sort of my credentials for accessing your forge a browser which is right here so I'm going to click on visit neo4j browser and I get this screen in the airport a browser so note though just give it a quick overview of new york' browser so unity browsers is a web application in this case it's connected to my neo4j instance running on on AWS but I could also have it running on on localhost basically I write cipher queries up here so match on all nodes and then return the count so how many nodes do I have in here for J I have 0 if I had anything I would have some sort of graph virtualization I can also click on this database drawer to to sort of inspect the the node labels and relationship types that I have but I don't have any data right now one another interesting thing I can do is I can save queries so let's say I was so impressed by this you know match on everything and tell me how many nodes I have I was so impressed by this query that I wanted to to save it into my favorites I could click on this this star sort of favorite icon and note here that I put a comment as the first line so if I hit the star now in my favorites drawer which is this one right here even after I clear out the query now I have this super awesome query that I can refer to later so that's useful if you're sort of working through a more complex query you want to save it later come back to it okay and then what we're going to do is we're going to work through each one of these browser guides so you can see this guide loads automatically it's kind of like a carousel there are multiple panes to it so I can click the left in the right arrow and then each one of these is sort of the index for another sort of chapter so we're going to start it at the first one which is recommend groups by topic where we're going to talk about how we would model and import this meetup data but before we go on I want to make sure that everyone is at this space right now looking at sort of this meetup screen so we'll take just a couple minutes to make sure everyone everyone's there if not raise your hand and we'll try to get that set up for you

Info

Channel: William Lyon

Views: 11,151

Rating: 4.9172416 out of 5

Keywords: neo4j, graph database, recommender system, python, jupyter notebook, jupyter, data science

Id: wbI5JwIFYEM

Channel Id: undefined

Length: 39min 20sec (2360 seconds)

Published: Tue Jun 13 2017