Using LOAD CSV in the Real World | Nicole White, Data Scientist, Neo4j

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so today's webinar is called using load CSV in the real world and I thought the best way to do that would be to download a CSV file from the internet randomly and import it Antonia 4j so I went to data gov and I went to their data tab and this is the first thing you see it's a consumer complaint database and if you want to follow along you can by pausing this video and downloading it right here and that should only take a few seconds but I've already downloaded it and I've already opened it up opened it up in Excel so today is going to consist of going through the CSV file understanding the columns and then we're going to I'm going to draw out a data model and then we are going to write the import query to import into u4j and this isn't really this isn't a modeling webinar so the drawing of the data model is just to help visualize what I want to do it's not of course there's a million ways you can model this data some might be better than others I'm just going to show my plan for today just to visualize the the data model so that we so that every wonder stands what my import query is doing hopefully so when you download the CSV file the fleurs you will first see this complaint ID column and every row in this CSV is a unique complaint so every rows can have a different number here in this column we have the product column the product that complaint was about and here we see a complaint there's repetition here meaning products complaints can be about products more than once so there can be multiple complaints about debt collection for example and then there's a sub product that the complaint was about and then there's an issue that the consumer had with the product account opening closing or management clearly there is more repetition here an issue can be complained about more than once and similarly for sub issues there's been repetition so clearly the sub product is you know some subset of products and then the sub-issues errs are in this issue category right and then we have the date receive the date the complaint was received and it's important to note here that I said also mentioned that I do have some columns hidden you can see if you have a good eye that I do have called sitting we're just not going to be using this today but you'll see more columns in your CSV if you maybe did definitely so this date received appears to be month day year where the year is this two digit number and you can see right now that if you click on it it's actually a four digit number so don't be fooled by how Excel displays things to you and we also have a company column the company that the complaint was against more repetition there can be more than one complaint per company so here's two purflex our bank and then the company's response whether it's in progress close of explanation close a non-monetary Li relief and then over here finally we have was it a timely response and was it consumer disputed was due to consisted of complainer dispute the company's response to its to their complaint so with this let's next draw out a data model so we have an idea of what we want to do so again like I said every row here is a unique complaint and I'm going to want to have a node for each complaint with its ID and then also want remember I the date is going to be per complaint I want to have the date of the complaint on the complaint node the data was received but Nia joy doesn't support date type objects so I usually split it out into three separate properties year and these will be numeric those these will be integers month and day and I'm going to show how I split that column that date column up so that I achieve these three separate properties on the complaint note later but that's something that I find myself doing constantly when I have a date and I just I want to put it as three separate hugh merrick properties and filter on it later so every complaint will be a node i'm going to do bootstrap and then yes it didn't work every complaint has a company that it's against so will point to a company node that has a name think of america for example and will say that the complaint was against the company so the complaint is against a company and then there's also a response to the complaint so there's a we're going to recreate a response node that has a name like closed and i think there is in progress on there so each response note is going to be unique by the name the name of the response in progress or closes on non-monetary relief and then the so there will be a uniqueness constraint on that name property for each response node and similarly for the complaint nodes there will be a uniqueness constraint on this ID and for the company nodes it'll they'll be unique by the name property and that's going to be true for pretty much all the nodes actually today but the the per complaint part the whether or not the response or in terms of the response whether or not the response was timely or disputed i'm going to go ahead and store that information on this property so i'm just going to do true/false was it disputed was it timely i'll just put those as billion properties so the response in progress or clothes or whatever to this complaint you know was a disputed or was it timely I'm going to go ahead and put that on the relationship and then all the complaints have issues so then I don't know I don't remember one of the issues was but the issue will have a name property it'll just be the issue itself I'm going to go ahead and call this with so a complaint with this issue and whatever the issue is called and then a complaint is also about a product and the product node will also have a name that it will be unique by and say the complaint was about this product so complaint against a company with this issue about this product and we also want to do the sub products with whatever their name is I'm just going to and say it's about that sub product as well and similarly with the sub issue which will have a name property I'm going to go ahead and say it's just it's a with that sub issue so again this isn't it this is in a muddle in webinar I'm just trying to visualize what I want to do with after looking through the CSV file how I want to model this so and then also I'm going to go ahead and model this relationship so that if I do want to ask the question what are all the sub products of some product I just want to be able to traverse here instead of having to scan through every complaint to get that hierarchy because that would be an expensive query for a simple question so I'll go ahead and model that as well and this is pretty much it this covers all the columns that I want to deal with today so complaint is against a company there's a response to the complaints whether it's disputed or timely with an issue with a sub issue and I think that pretty much covers it so with this visualization now we can start getting into how we're going to write the import query but before I write the import query it's always a good - a good practice to ensure that what you think is in your CSV CSV file is actually in your CSV file because like I said Excel might display things differently you know who knows there could be weird characters you weren't expecting so it's just good to to go ahead and load it here into the browser just the first row and display it and see what loads CSV is seeing so I need to find this I think it's in hold and you have to limit one otherwise you're going to have a bad time so here's the first line from my CSV file and we see that the the year on this date is four digits and you can also see some of the columns I had hidden in my in my Excel spreadsheet I had zip code submitted via hidden but it's displaying it here so here we're seeing what loads CSV is saying and it's important to understand what loads CSV sees versus what you see so here I'm getting an idea of what I'm dealing with and now I also want to troubleshoot how I'm going to deal with this date received I know I want to split it up into three separate things and that's pretty easy I'm just going to make cypher do the work for me there there's a trade-off between you know cleaning things in you know some programming language you know cleaning up a CSV in Python or R or whatever versus letting cypher do the work for you here is so it's so simple I'm just going to split on this forward slash I'm just going to have cypher do the work for me so so I'm going to say let's see I'm going to get that date received column out line dot date received and it's important to note if you have a space in your column header you need to surround the name of backticks so with Lion date received as dates I'm going to return the date so here this is just giving me a string and I'm going to go ahead and split on this this is something you'll have to do a lot if you want to have cypher do any any work for you split that on the date or split the date on this floor word splash rather and now that's going to turn it into a length recollection and we can access those with indexes so the first one is going to be 8 0 the second one 8 1 so on so forth so with that it'll be easy to to split that up so I'll set the I'll set the month to date 0 I'll set the day to date 1 and I'll set the year to 8 2 so in the first aeration I'm going to have cypher do a little bit of dirty work for me so with that I just want to show the workflow of when I when I'm gonna have cypher do something tricky for me I'll just try it out on the first line and figure out how I'm going to do this part and then I'll just copy paste it into like my big my big query that I end up doing later so I've already written this part so I'm going to use the I'm going to use the cipher syntax so load CSV with headers from the file path as line this is the alias so I'm going to say with line and this splits as date I'm going to all the way back up a bit here because I'm forgetting my golden rule already so the golden rule to load CSV is to always make sure you have indexes on properties that you're going to be merging on and to use periodic commit because otherwise it will try to do it all at once and that's no good using periodic commit is going to try to the default is 1000 and you can think of it as committing every 1000 rows you can obviously you can tweak this depending on your machine depending on a million things that will depend on your air specific use case but I'm going to go any is the default using periodic commit load CSV with headers from this file path as line but then before I even get into that I need to do something really important so all these nodes are going to are going to have a uniqueness constraint on them the complaints are going to be unique by this ID the companies will be unique by their name the response will be unique by their name and by creating unique as constraints you always you also create indexes on those properties so I'm going to create a constraints on the complaint nodes and assert that the ID is unique and the for this first part this first iteration through the CSV file I'm just going to model the this top part of the model a one to show how easy it is to model incrementally you know I don't have to import everything that I've displayed here at once let's say I'm only interested in this part of my model for right now and then later all I want to go in and add in the products and issues and I just wanna show how incrementally it's easy to add that add day - your graph that already has data in it so for this first part I'm only going to do the complaints the companies and the response nodes and of course their their respective relationships here so I'm just going to create their unique dis constraints relevant right now that will fail miserably on the company node when assert the company name is unique and create a constraint on those response nodes so I got those and then I can start writing my so this is going to be for complaints companies and responses comments in cipher are double forward slashes so using periodic admit those two three letters with the line and with the date split as date I'm going to create a complaint node with an ID property late ID and it's important to know that load CSV is going to treat everything is a stream literally everything even though this is clearly an integer it's going to import it as a string so you need to make use of the two int and the two floats functions in cipher so here I'm going to use two inch because they're all integers some of the I'm creating a complaint node with label complaint that it has this ID it's an integer version of this complaint ID column and again I'm using backticks because there's a space here so I've created that complaint node and now I'm going to use the merge keyword because I want to I want to find or get a company node that has the name line dot company there's no spaces i used to lined up company without the backticks and i'm using merge here because as we saw earlier companies are going to show up more than one time as we iterate through these complaints because of course people complain about them or the same company right and we don't want to create a new node when this is encountered we just want to find it so on this second iteration the flag star banknote will be created and then on the next it will be found and that's what merge does it is find or create so I'm going to merge on the company so I'm going to create a finder get a company node with this name property and I'm gonna go ahead and uppercase it I'm going to uppercase everything because I perhaps I don't really know much about this CSV file it could be that a flag star Bank is capitalized differently elsewhere and I don't want that to be treated differently I want everything to be the same regardless of how it was capitalized mostly and of course you know that's one of the trade-offs I talked about earlier that would be a really trivial thing to do in Excel or any programming language to uppercase everything but I'm it's just super easy for me to use upper with within my import query and that's what I chosen to do of course you can clean this up yourself before the import queries so you can avoid using upper but you cannot avoid using too int because no matter what you do load CSV is going to read it as a string so I'm going to I'm going to merge on the company and I'm going to merge on the response type so the response node that would've been bad so the line was it company response and I'm going to uppercase that too so here we've just created nodes create merge merged and now we're going to actually create the relationships that we talked about so we want the complaint to be against a company and we want the conce to be to the complaint so that was the top part that we talked about here the company complaint response and then recall that I also wanted to put these properties as or the disputed and timely information as properties on this to relationship so let me go ahead and put an identifier here so that I can set properties on that relationship so I'm going to set our timely I'm going to go ahead and use the case statement because over here in this in this column we have yes and no strings yes no blank and over here we also have I believe yes no blank and I'm gonna go ahead and make the assumption that only yes is true blank or no is going to be false so here I'm gonna use the case statement so case line dot timely response believe it has yes line timely response when yes then true else else and so I'm doing a simple case here when on the timely response column when it is this string then set our timely to true otherwise set it to false so when it's yes set it to true when it's blank or known set it to false and then similarly for our disputed the the sputum property on on on this relationship here I'm going to use another case statement so I'm going to do line dot consumer disputed let me don't check yes when wells bowls and okay and I think that should be good for my first iteration so another thing to keep in mind when using load CSV is that you pretty much always want to use the shell it's going to give you better error messages especially if you have weird characters in your CSV file it's pretty good at telling you which line has the problem which has been helpful for me in the past so in if you're on Mac or Linux it's pretty simple to start the shell you just navigate here to your where you have any upper J your name for J folder and you do bin / / J shell and I should open up for you and if you're on Windows you can do it a hard way or you can do it through the browser which I think the the latter would be easiest for most you click on this eye thing and then you go down to the classic UI web admin and you power through these pop ups and power to old console and then you'll be able to talk to me ajay through the shelf the only downside is that in in my shell i'll be able to copy/paste a bunch of queries that are delimited by semicolons whereas here you will you will just get an area that you have to do queries of one at a time but it's it's i think it's better than the current solution of running the shell on windows so kind of a tangent but i want if anyone is following along on windows just just do this shell for now because running it natively on those ism is a long process so let me go back so first of all of course I need to create these uniqueness constraints because the golden rule with load CSV is don't use merge on a node and a property unless an index exists on that label property pair so here I'm using merge on companies and responses and you can check with your ski your currently existing indexes and constraints by doing : : schema in the browser so here all my indexes that I want are online so that is good to go and then if I do not have any syntax errors Oh oh my gosh I almost forgot to set the date so I created that complaint node and I wanted to set the date properties on it so under here I'm going to say complaint year is the an integer version of date it was the last one remember complaints dot month is an integer version of the first this is American date so it's all out of order and I want and the day was in the middle so right I've created that complaint no time setting the year-month-day properties as integers remember I split it into this length three collection up here and I'm just kind of double-checking here looking for anything terribly wrong but usually it'll it's pretty good at telling me what's wrong and I the second golden rule do I have using periodic admit yes I do so hopefully hopefully that that works out so this is the bulk of it actually so this will take a few seconds while that's running trying to fit it all on my screen here that's running let's think about the next part that we want to do so we just did company complaint response and now say we want to add information about the products and issues so now this next part we're going to model the complaint being about a product and the complaint with an issue so I'll go ahead and copy this part and we say that the query finished in 42 seconds and it created three hundred thousand ish nodes and six hundred thousand ish relationships so that's good news for me so I'm go ahead and copy this over and I'm going to switch this to match because for the second iteration the complaint nodes already exist in the database now right we have we have complaint nodes now so they already exist I don't like that color and we have the company's and responses I'm doing a bit of housekeeping here so we have the complaint was against some company and this is the response to that complaint here's the time line disputed property is this is a good way to ensure of course to you know go look to make sure your data made it into neo4j and we have the company and then we have the year-month-day here that I expected it this is what I expected it to be so this all looks good but my point is that there's already data in the database now the complaint notes already exist so now we want to match on them so we're going to match on the complaint ID and for this next part we want to create the product and issue nodes and again the the products and the issues are going to show up more than once so we're going to use merge there and as I'm typing this merge keyword I'm thinking okay I'm using merge I definitely need to create an index on this property that I'm about to type out so I want to create a product node to find or create a product node with this name and I'm going to go ahead and uppercase it and I'm going to go ahead and write this out just so I don't forget I want to create a constraint on product nodes and assert that the name property on the product node is unique so that merge has a has an okay time finding or creating those product nodes and similarly for issues assert I name is unique because I'm also going to do that same thing for issue nodes in the create an issue node that has a name property of whatever column we're on or whatever yeah whatever Rover on and I believe the other called products yeah the columns are product and issue I'm upper casing both of them so I've mashed on the complaint I've gotten the complaint out of the database and now I'm creating the products and issue nodes finding or creating and now I'm going to create the relationship between the complaints and the products I'm going to say the complaint was about a product and the complaint was with this issue and I think that's all I'm doing for this iteration here I just wanted I just wanted to add on those two nodes just to again to show how easy it is to iterate on your model so first I'm going to create these double check my spelling okay and again in case I want to be extra sure okay I'm going to check my schema so thought out issue are online with indexes if you have a ton of stuff in your graph already and you add a add a constraint or index it might take a couple seconds for it to be online so I always go check to make sure that they are and I don't need this part anymore I don't need to split the date so I need to copy that over so golden rules do I have uniqueness constraints or do I have indexes rather on the properties for nodes where I'm using merge looks like I do and do I have using periodic commit at the front mccurry and I do so if I cross my fingers let me double check about product with this you this should be good to go and now for this next part while that well that runs I need to show how I deal with empty strings so with an empty string if you were to just naively go for it you would be pointing at nodes that have an empty string as a property and you don't want that and there's no sense of null values that neo4j you can't have a node with a null value as a property so I actually just want to skip these rows for modeling the sub issue so when a complaint doesn't have a sub issue I just I don't it's just not going to have a relationship pointing to a sub issue at all so actually in this next iteration when I when I start adding sub products and sub issues I just want to skip the rows that don't have sub issues and don't have sub products and I'm gonna do that separately both for sub products and sub issues so what I already did this because I didn't want to have you guys watch me do this for in 45 seconds but I just did this it's so easy I just did it in Excel I just I turned on the filter I slipped from this drop-down I omitted blanks and then I copied this column and this column so we could find which complaint it was into a new CSV file and that is over here and similarly for so then I cleared this filter over here and I similarly over here for sub-issues got rid of the blanks I then copied this column and the complaint ID column because again this is the complaint ID column I need that because that's how I'm going to find which complaint had this sub issue so I'm copying pasting both these columns into a new CSV file and that's the result of that so now I have these two needs two new CSV files one called sub-issues CSV and the other called subproducts CSV and I'm going to go through both of those now to model the these sub issues and sub products so I'm going to go and copy this over and I now have a different file name so I'm going to switch that I'm going to do the sub issues first and I'm copy a few more things because I need to match on the complaint so I'm going to match on the complaint that already exists in my database and I'm also going to match on the the issue that the complaint was with since we also already added that we just added that in the last query let's check on that actually that took 16 seconds and only 88 notes were created but around 600,000 relationships were created but that's because there are only a handful of products and issues so it's a pretty pretty dense note there so right so we just we just created the complaint with issue so now I'm going to match on it because I remember that I want to create the I just deleted it over here but um I wanted to create the sub issue in Category relationship so that's why in this next iteration not only am i finding the complaint I'm also finding the issue that was with so that when I create this sub issue note I can point to the issue and the reason for that is more of a modeling thing where if I like I said earlier if I wanted to write a query that just find me all this issues of this issue then I should just have to reverse this handful of relationships I should have to scan through every complaint to get that information but that's a that's a beyond this beyond the scope here so I'm going to match complaints I'm going to match complaints with issues and then I'm going to merge on that sub issue node and as I'm writing merge I'm thinking do I have a uniqueness constraint no I don't so create constraint on sub issue nodes assert that the s mean is unique so here I'm going to merge on that sub asyou node because the sub issue of course will show up more than once and it's called line thought something sub - issue and I'm not entirely sure how cipher would deal with that - so I'm gonna go ahead and use back ticks just to be safe and I'm going to uppercase it so that we'll find or get the sub issue node and then I want to create that the complaint was with that sub issue and everything create here because I know it's unique because every row is unique by complaint I know I won't be creating this more than once but on this next one I want to use merge because this will happen more than once you will encounter the same sub issue and issue pair in the CSV file and so you use merge so that it will only just rate this relationship if it doesn't exist yet so I've matched on the complaint the complaints issue I've merged in that sub issue node I'm creating that the complaint is with that issue and then I'm merging this categorical hierarchy type of stuff that I want to keep track of so let's copy that in let me build check everything I have the I have the index online now I have using periodic commit at the front let's see SP with headers from this is hopefully the correct file path I'm just checking for typos at this point upper line issue in Calgary I shouldn't that looks good to me so go ahead do that one and next that took six seconds so that's pretty good but of course it's a subset so it looks like there were only 84,000 ish complaints that had sub issues because that's the only those were the relationships created actually there's going to be a bit more for this relationship too but around 84,000 probably had some issues so that was pretty quick and then I'm gonna copy paste some more so for the sub products we're going to match the complaint again of course we're going to match the complaint is about some product which we've already added the graph so we're finding it and from our database and then we're going to do pretty much the same exact thing here we're going to merge sub product on this name property sub product and again I'm not going to stop saying it you need to create a key you need to create an index anytime you use merge okay great constraint on subproduct assert the name is unique and then I can use merge I'm going to create a sub product node finder get with this name property uppercase version of a sub product column from from this CSV file right and then I'm going to say creates the complaints about that sub product and merge sub products in category product and that should be that should we get together so this is pretty much the same so I'll just get it going here so I need to add that constraint double check that I have the right file path double check they don't have anything silly upper line so product complained about okay so the of this should be similarly quick I think and then with that we have everything that we initially drew out over here so yeah again I mean I just want to show that there's a there's a trade-off between using you know doing your own manual cleaning in in Excel or Python or whatever you use versus letting cipher do some work for you it's just more whatever workflow you're comfortable with I'm very comfortable with cipher so I let it uppercase things for me and split on forward slashes and stuff like that but then when you get into the stuff that's maybe a bit more complex you know you'll find pretty soon that maybe it's easier if you're not um as comfortable with cypher to do yourself with any programming language of choice to maybe clean up or you know manipulate your CFD file as you see fit to avoid doing anything really sticky with cipher but as you saw I this was pretty trivial for me to create these these two new CSV files so that I could avoid those empty strings that were in this original whole CSV file so it's just however you want to go about it really so that should that should be done and of course you want to ensure your data was uploaded as you expected it to be so I'm going to go ahead and find how many complaints there were so it's like they're about not about there were three hundred and ten thousand nine hundred thirty three got uploaded and let's check let's um that might bloat my computer actually now I never filter on okay I never mind but that's how my or and that's how many were in the the CSV file because already looked at it and right to be a more time care well that is all I have I will include the well you can you can find the code for all of this if you want to reproduce this on my github go to github.com slash nicole white slash consumer underscore complaints and you'll be able to recreate this yourself if you want thank you
Info
Channel: Neo4j
Views: 54,778
Rating: 4.9499998 out of 5
Keywords: neo4j, graph databases, graphs, Database (Software Genre), NoSQL (Software Genre), Comma-Separated Values (File Format), data loading, ETL, connected data, load csv, graph algorithm, data import
Id: Eh_79goBRUk
Channel Id: undefined
Length: 43min 54sec (2634 seconds)
Published: Wed Mar 25 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.