Building a Knowledge Graph Using Messy Real Estate Data | Cherre

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hi everybody I'm John graden from Cherry I apologize for advance from my voice I've got a little bit too passionate talking about my company yesterday at our booth so you're gonna have to listen to me go in and out overtime but I got some tea so we're good so my name is John Maiden I'm a senior data scientist with cherry we are a real estate data technology company so I would like to quickly talk about what we do before we get into the business of knowledge class so cherry is a commercial real estate company we take data from multiple data sources where they come from so they're coming from public data sources there's lots of great public data sources when it comes to commercial real estate if you love and if you live in New York City you should be proud of your city and there's lots of great data there there's also some good national data available across the country we take in third party paid data sources we also combine it with our customers internal data we put it through a date engineering pipeline we transform it and then deliver it through a series of api's we have a couple of our great date engineers in the audience so if you have questions later I'm sure they can happily answer those but the idea is that fundamentally we're a data company so we're taking and we specialize in commercial real estate data so we take that data and we transform it and deliver to customers I probably should just make a quick distinction so if you're not if you don't know the difference commercial versus residential residential you're buying a house commercial you're buying a building right you're buying multifamily or you're buying retail they're buying office space so this is in terms of data in terms of information in terms of business plans retails more like b2c and commercial is gonna be b2b so larger ticket items the data is gonna be a little bit more contained a little but not as readily available compared to someone like silo so we take this data and then we decide we're gonna enhance it we're gonna take and we're gonna put it into a knowledge graph we're gonna take all the different data sources we need across multiple data sets combine them together and add and extract information and insights off of the data we can provide so before we get into the details I just want to take a step back and talk about knowledge graphs I know this is becoming a more common topic in the industry so first I do want to applaud your courage in coming to another talk about knowledge graphs so what knowledge graphs are and I'm gonna go through a very simple definition so being the smart person that I am I googled at first first answer is Google's knowledge graph so if you're familiar with Google's knowledge graph it is all the data they've collected that helps drive a lot of their products it is defined as a knowledge brave a knowledge base with a graph structure say okay it's some information here I can work with that maybe I want to go a little bit deeper into the definition okay you go to Wikipedia Wikipedia starts giving you a much much more technical definition I started seeing once I saw ontology I actually ran the other way just more than I wanted to handle so luckily I found a great article called wTF is a knowledge graph just at my level of information so why do you care about a knowledge graph why not just have all your data and databases right that's traditionally how data stored so the power of a knowledge graph is this it's very easy to visualize data is connected together via relationship right if you think about graphs work you've got a node you got a node you got an edge connects them when you visualize this in terms of data it's for example John Maiden is a speaker John Maiden is connected to data council he's a speaker at New York City Year 2019 and the future data science track now if you were to extrapolate this and say okay now I want to find for all the people who spoke at data science data council today what other conference talks have they talked to discuss that in the past year well if you think about the traditional database approach well I'd have to take this I'd have to get all their names I'd have to take probably another table maybe two three tables join them together do a couple of aggregations like the traditional database structure does not work when we're trying to look at how data is connected right it's gonna be separate across multiple tables it's gonna have people in one table it's gonna have companies in another all that's perfectly and well-organized the way a traditional database works but if you want to actually get insights from that it's just not as well easily work easily developed so we like it because it just it's natural it's easy to visualize how data is connected and it is also that easy to add more data right I don't have to worry about adding something to the people table or the company table or the conference table I can just add more rows in my data and say this is a person connected to a conference and you know the strength is its traversable so I can easily go from one connection to another connection to another connection so the whole point of a knowledge graph is that we want to extract data from this right so we have to first say okay I've got it if I want to develop a knowledge graph I have to think about what questions I'm trying to answer right so the questions I want to answer are related to commercial real estate the first question that we're answering right now and it's a very important one for our customers is who is the property's true owner now you would think okay I can look this up right if I have a phone book I can say that you know this is the person who lives at this address this has got to be there you know the owner most likely if they're not renting when it comes to commercial real estate it gets a little trickier it almost became a little bit easier recently but it's not become tricky again so for example for new for most commercial properties each building is owned for different reasons under a separate LLC so I happen to work at 989 6 Avenue theoretically the owner of 989 6 Avenue is 989 670 LLC if they open to own the building next door it is 996 Avenue LLC very very creative in the naming so the question would ask is not who owns the building you really care about is who owns 1989 6th Avenue LLC so an interesting side note if you've been following New York City or New York State News recently New York State through the tax department had made an issuance that all buildings had to contain the real name of the owner so all properties had to be connected to some person that they could use for identifying ownership they backtracked really quickly so the commercial real estate industry pushed back very heavily and they've now just said it's for condos I think it's condos in New York State in New York State you have to list who the true owner is but otherwise you can still have every separate buildings gonna be an LLC but that also means that you know whether that's for New York State or not nationally you're still gonna have the same problem and a lot of places are not gonna have the same amount of in-depth data than New York State offers or New York City specifically so if you want to do this New York when expand this nationally you have to come up with some way that you can answer this question throughout all the different data sets you want to join together now once you've extrapolated so you've said property up to lista donor up to true owner you can now go back down and say okay now that I know the true owner of this building what are the buildings that they bought what other buildings have they bought and sold in the past five years if you then also look at some of the data a lot of mortgages a lot of properties when they're transacted will list who the lender is I can then go tangentially and say okay well I can look at service providers and say which lender has bought and sold a whole bunch of properties in New York City over the past year which of these properties has seen a huge amount of defaults right I can add additional insights that are just beyond properties and anyone who's connected to them it's theoretically information I can extract and use in this graph okay so so beyond getting basic information out of the data the next thing that we care about is strategy so now that I know what they've bought and sold the past five years I can say well do they have a strategy in this do they look at only small multi-family homes they only invest in Queens they only care about Tulsa Oklahoma do they only ever work with this one lender do they only ever partner with this other company like there's so much information of the connections in the history this is all time-dependent knowledge graphs so there's a lot of information that will change over time that you want to capture as the data expands over as you gather more information and then once you start gathering a graph you can so I think that craft structures and then you know you get excited you know maybe start playing around like craft avec you know take this and start building models if you're familiar with residential real estate if any of you own homes the typical approach of how much a house is worth as I say okay well let's look at some of your neighbors if any of your neighbors have sold and recently around you you take that number and you average it right same similar houses similar number of bathrooms symbol number bedrooms average it that's the value of your house buildings is a little bit harder because buildings aren't exactly standardized there's no Levittown of building development so finding exact matches are much harder and it's still a very manual appraisal based process if you now have all of this data collected you know all of their you know all the components of a building you know how they're all connected you know the strategies of the companies that do the buying and selling you can come up with more appropriate valuations you can come up with good property comp models there's a lot of interesting valuation metrics that can be built off of the data now that you've collected into a giant knowledge graph so let's look at the actual data so looking just for New York City so just collecting all the data that we have for New York City turns out to be tens of millions of nodes tens of millions of edges it is a very very large graph it's not a very connected graph right because you know people might be in some way you'll see some clusters here but it's also very distributed this is me choosing one specific property and then looking at all first into secondary connections off that one node now if you imagine how big this can get as it gets larger and larger it is not something that you can visually identify what you want to be able to do is to put it on to a nice graph and then specifically query for information that you need from it so how do we categorize our graph so the properties the objects that we put into the graph are things like people names so properties for us so definitely all properties in New York City so all the properties get their own type of node anyone who is a person who's connected to those properties also gets our own type of nodes corporations will get included contact information is also something that's his own node so phone numbers emails so if you think about like how it could be structured it's not like it's a property like a person doesn't have a phone number as a property a person is connected to a phone number which means that as you're building out your graph if other people have that same phone number or other people have that same email address you know how they're connected right so all of this information has equal footing in the knowledge graph so it's the they're the the level of information that the attributes of each node are pretty minimal because most information connected to a node are nodes and themselves so dialing this town I'm actually taking this back to the original node this is what we want we'd love a knowledge graph to look like right I should be able to say this is my property here are the nodes connected to it obviously this is how it should work specifically this is a restaurant in the West Village luckily there are smartly the family that owns the restaurant also happens on the building good suggestion if you're in the restaurant business so they happen to also own a couple of other restaurants so the red node is the property the blue nodes are some of the other corporations that they've created to own this property as well as their other restaurant properties and then the orange nodes are family members that have been associated these properties over the past you know 30 40 years not naming the restaurant protecting the innocent probably a good place haven't been there so if you want to build a good knowledge graph we started with what why don't why do I answer the next part is well how do I build it I got to think about all the different types of data sources that go into or I can you know steal obtain hopefully not Co obtain cleanly pull in no one heard that pull in all the possible potential data sources that we can have access to to build a really powerful knowledge graph so traditional commercial real estate picture two people staring out into the distance everything looks clean nice new it's all perfect so how do we translate this into data so this is all potential data we can obtain I mean specifically for New York City but it can be extrapolated nationally so all property transactions are done and stored in a database so property a went from this party to this kind of party you have the date you have the amount of it sold for you have the Mortgage Lenders associated with that if there's any involved all this is captured by the city everybody pays taxes so if you pay taxes all of your property information and all of your tax information is publicly reported New York City gathers that you can then find out how much New York City assessed your taxes at what they think your market Valley is going to be how much you've paid how much you haven't paid if you've got any exciting abatements those all get pulled out if you're listening building permits or any attempt a permit you're requesting from New York City you're gonna have a contact person you can have a contact address you can have a contact phone number and also the city obviously is gonna list all the properties that they currently own within the city so if you want to find out who are the largest property owners Columbia is actually one of the bigger ones NYU is actually one of the bigger ones this is stuff that you can quickly if you've got it all collected you can find out that the trustees of Columbia University owned all this stuff I think New York Times did a series on this a couple years ago I think it's I said Columbia owns more I can't remember who owns which what owns more and more but one of them owns more but the other one has one more valuable property I think it's NYU because it's obviously in mid you know downtown so for New York City who are the major data source providers if you think back to ancient Mesopotamia how did writing develop writing developed for tax purposes all right the government had to know how much you had so they could determine how much you had to give them who is the biggest data provider in New York City because of the open Dacian open data initiative the Department of Finance they provide all prop a transaction data they provide all of the tax assessed data there's tons of interesting data sets that they provide because their Department of Finance they're also really good to access if you want to email them they're very good about responding very quickly Department of Building is also very important they also contribute a good portion of their share but you know Department of Finance is the main driver which also means that when you're looking at the data and this is like a knowledge graph building perspective the main unit of property assessment is gonna be the tax law so it's not going to be the unit it's not gonna be the building I mean Department of Buildings works with buildings but if you're working with Department of Finance they're thinking about tax Lots because in the end of the day that's what you buy and sell you're not actually buying a unit or selling a unit you're buying and selling a tax lot from their perspective and that's nationally how many many governments think about properties is they're not really properties they are tax Lots now tax lots could be these land it could be a unit it could be a building like all these things can overlap air rights can be a tax lot the garage space can be a separate tax lot like anything that can be cut into pieces and sold rented off to someone that is a piece of a tax lot that could be part of larger tax ought so how do we translate the state into a graph so specifically taking a transaction taking a row from pad which is the property address directory for New York City I have a bbl which is the borough block lot identifier so the tax letter identifier it is connected via pad to a specific address so that is one connection in one pair in my Mayan knowledge graph and then I can look at accros which is the automated city registry information system I don't look this up five times remember what a cross meant but yeah so a CRA's is all of the property transactions who sold and bought whatever in New York City over many many years you can connect these on the BBL you now have two nodes that are connected beach via third node now if this was the end of the conversation you're like oh I just take the data I've got primary keys and just turn them all together I got myself a knowledge graph that would be so easy I wish that was it's nice it is it is not that easy and the reason is because data is not standardized what is data standardization mean it means that data is messy it is collected first of all from many many years you can go back to I've seen data going back to the 40s so some of this stuff has been transcribed into systems even today not everything is electronically typed in it someone's writing it down and someone has to put it into a database so you think about how many different ways you can write a person's name how many ways you can write an address when I was growing up I lived on a street called Berkeley according to the town I grew up in it is spelled the er que le why my parents demanded that it be spelled ber ke le why and every time they address something had always had that extra e because that's the way they did it the postman still figured it out he knew how to get to our house he knew exactly where it was even if it was a slightly a bit off and that's what you have to do is you got to think about with all this data I have to worry about all the different ways I right people's names company names and address names because even if I introduce a little bit noisy little bit typos or if I have an ultra and smelling they all have to go to the same nodes otherwise I'm gonna have this really really noisy graph and I'm not gonna have any knowledge or any insights that I can get from it so people corporation standardization so these are somewhat similar but also distinct you have to be able identify that different sources of data will announce different man names differently so John W Maidan has to also be mapped to Maidan Camacho or Maidan Camacho categorization is important so a lot of different data sets will try to categorize if they think it's a company versus a person they're not always accurate they do well on some certain circumstances if or if not but you have to be also add your own spin to this you have to say personally as I'm developing a knowledge graph when do I consider something to be a person versus a corporation because it's not always that distinct good example would be the wreck will chose trust of John Maidan in this case I'm gonna say it's a person because it's connected to John made in the person now I could have John made an LLC that's gonna succinctly be a corporation if maybe I'll come up with a pattern I'll say anytime I see the word you know if I see John King anytime I see the word King that's gonna be a person until I hit Burger King and then I'm in trouble so you have to really think about your categorization you have to really think about how to distinguish all these and handle all of the fact that you can have lots of grievance this is a tricky one right so a grant Herman is actually part of a legal firm that represents a lot of real estate companies around the city so how do I make sure that all of these different permutations of writing the name are actually listed as a corporation and if I want to get distinct I can say there's specifically a lawyer and a service provider which distinguishes them from like a REIT that's gonna buy and sell property and finally common names so John Smith is not the most common name in New York City if you're curious you can ask me later at the question and answer session but yes you have to find every city will have their own John Smith and you have to make sure that John Smith doesn't own you know you obviously know John Smith doesn't own 30% of New York City property so how do you distinguish this John Smith from another John Smith so when it comes to names and corporation standardization a lot of the stuff is not that is not that uncommon so a lot of good regex patterns will help clean up you know eighty ninety percent of what you're looking for so if you pull in pull all your data look at all the different common patterns you can come up with you know lots of red Rex patterns that you can use to extract out if you've come up with enough that you think are good you can use a trick like you know some engrams with xt booze so you train on the patterns that you already know that you want to take out you then apply engrams to all the remaining patterns and then use some type of model like xt booster then predict other patterns that are similar when it comes to name disambiguation one of the tricks that I happen to like is dumping everything into a graph so choose your favorite fuzzy string algorithm make those you know anything that matches up above a certain threshold put that into a graph and then you'll see all the different permutations yeah because you have to think about like you know John W Maidan might be similar to JW Maidan which might be similar to Maidan comma John dub or maiden comma J and so if you try to do everybody compare it you might be in trouble so if you can at least dump this into a graph you can find connections of you know strongly connected components and then from there maybe do a little bit manual editing depending on how big the graph is but the end of the day the most critical is actually just good reference state I mean I think one of the things that's gonna come up come through this is that knowing your data very well understanding in this case commercial real estate data and all the peculiarities is critical to being able to distinguish what makes sense for this domain so having good reference data you know listening of all corporations in New York City is very important because you can quickly hopefully filter out what are names that you may maybe a corporation but obviously or not versus everything else that you've got so this is the part where I'm gonna jump into a little more detail this is something where I've been spending a lot of time in the past few months working on which is address standardization addresses are actually a little bit trickier because there's a lot more going on there's a lot of words in addresses addresses have clearly identified components I mean so do people in corporations but not as many as addresses you can have lots of abbreviations and alternative names like and again 989 six Avenue we could also be 989 Avenue of the Americas and we will still get our mail people can misspell or tight a spell alternatively the mail USPS still knows how to get to you sometimes they're just obvious typos right if I said 1989 670 in New York City New Jersey obviously you know I met New York that should be something that can be quickly corrected if there is no other if there happens to be no not in New York City in New Jersey and then one of the other very common ones is we're ingesting data from multiple data systems and so sometimes they don't eliminate the components very well so the one that's most common is like Street units so St unit or Ave Unit usually it's the post type pre for the post type that gets reviewed with a apartment prefix so if you know how to accommodate common truncations that's also important and then also addresses are going to be coming in from different sources so if it's coming in from New York City may have a very clear this is the address field you pull in that address field if we're working with customer data they're gonna give us a collection of property leads so they've got some spreadsheet that's been sitting around for 10 years it's got you know a hundred thousand addresses in it you have to take that and it's gonna have all the things that they put in there with names company names building names identify and strip those out as well to get to the actual address so the general approach that we do is all the addresses as we get into standardization go through three parts the parse standardize and match parsing is one of the harder components so this is actually an NLP problem what you want to do is you want to take all the different components of an address you want to identify and hag them with the appropriate identifiers so taking yet again our address 99 670 or 417 please come and visit we're always happy to see people if I pass it into word tokenization for an LT k and auntie K is gonna tell me that I have numbers and nouns not particularly useful what I really care about is numbers street names directions city state zip those are things that I can actually match on I can actually use to then work with going forward so what I want to do is I want to be able to use some type of algorithm model any type of approach that can say obviously 989 because it's the beginning of string has to be an address number after that six that has to be a street name even though it's got a number in it as well if I get all the way to the end I've got something at the end that's a number so that's gonna be a zip code which is different from an address number occupancy identifier so we could be on floor 17 but we could also be on floor 17 a so it's got to be something that's dynamic that doesn't just say number is a unit number number could also have letters the standardized component is once I've taken the string broken and identified into my components I then put it into some type of a business schema so this is something that's actually kind of flexible you as a company can determine like what's the best way you want identify addresses there are a couple of really good standards out there but there's no one universal standard for how addresses should be matched if you would like to be able to mail something hopefully you'll match the USPS standard but generally it's taking your components into saying that anytime I see FL that's gonna be a floor if it's identified as unit identifier you know NYC is gonna go to New York provided identified it as a city so I've already categorized everything and once I've categorized that I can quickly just give my substitutions and parse was hard standardized is easy match is also very hard so match can be hard because yet again you got lots of typos once I've gotten through Parsons standardized all I know is that this thing looks like an address I don't know if this is a real address yet I just know that by now this is something that I can categorize is addressee type of shape so how do you get this done well like I said I've broken this into components so I could say 9 you know 1 2 or 3 is the street name I'm gonna identify I'm gonna do a sequel joint and say find me all addresses where 1 2 3 equals street name Main Street equals 1 2 3 equal street number Main Street equal Street name you know New York is City New York estate 1 0 0 1 is zip code and then you just do a sequel join you join it together you got your answer nice and straightforward you don't always have that right no one is ascended eyes exactly the same sometimes you have to allow flexibility so then you start saying okay well maybe I can do something with a sequel joint but then I add in some business logic right I don't get the best exact match I get a whole bunch of close enough matches I join on the components I know we're gonna be clean enough I do some aggregations and then I choose the best answer and I return that still good not too much work if you want to go crazy and I can tell you I spend a couple weeks going crazy you can also try some fuzzy joins so taking your string trying to identify it and then just matching it against all possible combinations assuming that people could make lots of different small types of mistakes so instead of you know getting the street number wrong getting the directions wrong these are all things that you could encode and then try to match against your database of addresses so getting to the technical aspects of how we're doing this on address standardization so parsing if you start doing some Google searches which is how I started everything always important to start data science with Google so not an endorsement of Google they're good company though so if you look at it a lot of people say oh I've got these amazing reject scripts I can identify any address that you want using regex if you're using regex to identify addresses you're probably working in a very limited domain or power to you I hope you have a good solution we work with slightly messier data so reg X's doesn't help you can talk about hidden Markov models they're a big thing that people have used in the past our preference is conditional random fields because they're easier to Train and they produce it's easier to add features and it's been working very well I know someone also has open sourced a neural network approach for standardizing Australian addresses and US addresses so there are a couple of different options out there that you can work with you can train you develop hopefully find something good standardized like I said is usually the easiest part because you've already identified your components so this is standard regex look-up tables you know have goes Avenue nice and easy yet again demands business knowledge so you have to be able to determine what's the schema from my address is also what are my rules for how everything gets mapped and identified and then matching this last part so you can do sequel joins if you have data is clean enough I do a lot of user-defined aggregation functions of a big spark user so a little bit of business logic a little bit joins a little bit of business logic about joins go back and forth until you get the right results and then other fuzzy joints so that was you know for example like locality-sensitive hashing if you're willing to try things at scale and think you can boil the problem space down I kind of recommend a cascade approach because there's a lot of addresses like for us we've got actually billions of addresses in our day to address database and so if you're trying to match all this against billions of addresses you really only want like this is levels of difficulty like the easy stuff if he can match on a peer sequel join go ahead if you can do it with a little bit of aggregation but you're still there good and then the really really hard stuff like because you know fuzzy join does take a lot of performance that would probably be a third approach but generally you got to think of like easiest to hardest to get the data so what are some of the lessons that we've learned while doing this I am a big fan of understanding the data you know so building one neural network that fits all I just can't do it I'm sure there's more to people than I that can do it I prefer to understand commercial real estate data better than others so that I can make sure that my model and my approaches fit that very well humans are very important in this conversation we do have a lot of manual review it has to be done at scale because we are processing millions of addresses so finding smart ways to sample the data is also very important so you know being smart about looking through and determining what are your edge cases what's working what's not and how is a customer going to view this all right so as you're evaluating this there's the the idea of you know are you evaluating your system for an internal user versus an external user because an internal user might have different uses or a different interest in how its successful versus an external user so you might have different sets of metrics to determine success and also with data science you have to live with ambiguity you're gonna get some addresses you're never gonna get and but if you get most of the dresses most of the time you're doing better than most thank you and any questions well we're looking at mainly so obviously coverage is a big thing so how many do we send in versus how many get out there a couple of different available solutions who are benchmarking against that in the end it's gonna be human verification so accuracy just because we said we've matched it doesn't mean we're actually close so you know being smart about testing at scale the Princeton New York this week will here's all of the speech segments in the city resulting master addresses are there so appropriate understand well so yeah so what we're doing is so you have to start with to be able to match everything you have to have a set of clean addresses right to match against so that's one the thing that we've built out first is we pulled in all the different you know as many different clean address sources first and use that to build out the addresses to match against said answer your question okay well at the moment basically locality sense of fashion so I was using spark a little bit who was giving me some weird issues so I tried to cook up something by myself I'm a name if I remember I think it's Jose Gonzalez beside where where we will hypo and let the potential where we going to the Toya or the different others right well I think part of it is because you've already broken everything into components like I mean this is where the business logic has to come in have you have to think about like is it more likely that I'm gonna make a typo on the street number or the street name so some of us like you think about like or maybe if I get the city maybe I maybe I got the street name of the street number right but I got the zip code wrong so like there's different levels of what there's a hierarchy of what a human might get wrong and then also you can like what I have is orders of exactness like if I can match exactly on the address on the the unit and then the full address that's great if not I try to like remove certain components and make it less and less available to match until you get something that I think is I'm at least confident about [Music] not at the moment but it is a good point it's something we should be thinking about oh yeah so specifically for owner I'm asking its brokers so brokers want to be able to say I found a building I think it's beautiful I've got a client that would love it I need a contact number for someone in that building right so normally if they know the building they do you know unless you really know it well you're not gonna find out who it is so it's at least for them to get an introduction so they can then work with potentially getting about helping their customers [Music]

Info

Channel: Data Council

Views: 3,297

Rating: 5 out of 5

Keywords: machine learning, computer vision, AI, big data technology engineering software engineering software development

Id: Bp38pYrpdSY

Channel Id: undefined

Length: 35min 6sec (2106 seconds)

Published: Wed Nov 20 2019