Advanced Munging in H2O with Matt Dowle

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay great well we'll make a start then so I'm just gonna pick one example in the munging capabilities of h2o to demonstrate that one is we're going to pick ordered joining so what's a h2o frame its columnar has been from the start since h2o is first designed it's compressed highly compressed it's in memory only and one column of a data frame a h2o frame in Java can be greater than one node so there isn't any restriction like that and it's parallel and it's distributed there's a parallel distributed data loader called H dodo import file and the API as you know is from Python are Java and rest so ours data table works in a different way to other databases and other packages in there's no hash table at all so to join it orders the left join columns and then it orders the right join columns and then it's a binary merge the two sorted indexes and sorting is really fast in data table because it's a forwards radix and so once we've got a sorted index we can do advanced munging joins which are ordered joins which are hard which are possible to do in SQL but a harder so for example we can join to the prevailing observation and we could say give me the the last observation for a particular stock or a particular person on over the for three o'clock in the afternoon provided it was within one hour of that time requested so that's something I needed when I was working as a quant in cash equity research in investment banks like Salomon Smith Barney at Citigroup so that's the kind of logic that I'm used to and the website for their data table packages is there on the link but there's a problem with data table and it's single thread is it's single node and it's limited to two billion rows so I joined teach to a year ago to paralyse that and distribute that logic and to see how far we could take it so we've now created an open benchmark the link is here all the codes it's fully reproducible I recently hired an you're at gareki so he's working with me now on benchmarking and with benchmarks spark and empowering and compared that to h2o so we're looking for feedback from yourselves and from other software providers pull requests are welcome and we've just started with one very tough test to start with which is a high cardinality big join which I'm going to demo and explain exactly what this test is so the results on 1 billion Road rows or nine nodes are data tables slow as you imagine because it's single node and single threaded sparked who's next what these blobs are is the time of the first run time of the second run and the time of the third run so we don't want to report the minimum of a set of runs because that hides the time of the first run which might cache the data and then the minimum would hide the time of the first run is of course when you dive data scientist or using these databases in production or in research you only do the tasks once so it's a time of the first run that most often is the one you care about so that's the the red dot and h2o is the fastest currently on this tough test about 20-30 seconds and we've used the latest versions of all the software so what exactly are we testing here so it's high cardinality so it's not just a number of rows but what's in the rows so 500 toc States stock tickers is what I would call low cardinality millions of people medium cardinality billions of devices high cardinality and you can also have high garden ality with combinations of low cardinality columns like ID and dates or ID date and time or more all columns so what's the data look like in this test this is the 10 billion row example so we have two input tables on the left and the right of the two tables they're both ten billion rows long with two columns 200 gigabytes each so they just each fit on one node but obviously both of them won't fit on one node so we distribute them across nine nodes the key column is attend up to a ten digit integer so I'm sampling me sampling unit from the uniform distribution between 1 and 10 billion so it's high cardinality we have a few duplicates and those are the blue duplicates on the right there orange duplicates on the left and so we have a many to one or a one to many join here it's just an inner join if it's one to one they come back if there's no match nothing comes back in the result so the result looks like this we've got the orange key and that the value in the Y table is repeated once so it's a standard SQL inner join we've done outer join as well but I'm just displaying inner join here and the the h2o join is ordered so it brings the keys back in the in built in to the algorithm in ordered in there in the result they're ordered and it's stable so it keeps the original order of the rows maintained so it's suitable for time series and hto commands to do this are pretty simple I'm just showing the our package here but the pandas and the Python side is very similar so library h2o it's a cran package it's open source as you know you initialize a connection to the h2o back-end which is much like connecting to a database then you import the file which is parallel and distributed the left and the right table and then once it's in memory we cute many many statistics on it over and over again and work with it and that the result of the merge currently is the standard are syntax the merge just with h2o on the beginning with method equals radix and then we time the the result so to see this in action let's had started it running just before yeah so here I'm connected to our are 10 servers back in Mountain View each of these boxes I'm clicking is one of the server's so the blue means that that's a CPU which isn't active at the moment all these you can see my pointer moving as well the all these all these lines these these red and green CPUs is if' top running in each of these windows so we've got 10 of those running so that we can see the network traffic as well as the join starts and so we should see the CPUs light up in green as it starts and then these columns are the memory usage on each of the boxes I'm just using a simple command-line tool just to monitor the memory there so if we start the join going that's lazy so I have to print the result so it's off and it's it's in parallel as you can see and if we if we connect to the the node and have a look at the log file we can have a look at what it's doing as it's running so moment it's building the left index and you'll notice it's using the network straightaway so this is unlike other algorithms which are thought the pieces and then merge together the pieces instead we split up using the forwards radix and we split that we've send the data soon and use the network very early in the algorithm and send those parts of the index to the right node so imagine node one getting all the A's and no two getting all the B's so it's now sorting all the A's within the first node and all the B's within the second node which it's finished and it's now on the node nine so it's built the left index so now we've got an ordered index which we can use later on so we can do faster grouping if we're grouping in the same order as the index and it's ordered so we can do those advance joins like joining to the prevailing observation within a window and now it's building the right index for the right table so it's doing the same thing again using the network sending the data around and now it's starting to sort those ordered pieces on each node come on nearly finished so this is what I spend my day in day out doing watching these blue and green bars making it all go green and making all the black go white and getting frustrated that this one's been left behind so can we optimize it more so that it looks more like this so now it's got the left and the right index on each of the nodes it's now joining locally the A's on the Left table with the A's on the right table which is a again a fairly straightforward task on each on each node and then as soon as it's finished each piece it's then going to start grabbing the data from the other nodes so at the moment it's not really using all the CPUs we can see on all the other machines but as soon as the first piece completes it'll start to communicate with the others and it should start to fill in and be all all green so there we go got the first node communicating so we're trying to use the network throughout the algorithm so this is gonna sit here for another few minutes the whole join is going to take about six and a half minutes so if we compare to the the previous comparison to spark and Impala where I showed that hto is taking 20 or 30 seconds that was for 1 billion rows so times 10 it's a little bit more than linear scaling but it's not too far off so certainly 6 minutes is quite a achievement I think for h2o it's using the column the storage the compressed columns it's the compressed data transfer between the nodes and it's utilizing the network and the CPUs at the same time so any questions as we watch this thing run there's no spark in here at all at the moment though this is just one JVM process running on each of the nodes any more questions yep no it's no it's all thoroughly random in fact the first time we did this ten billion row jointer came back with fifty billion rows and I scratched my head for several months thinking there was a problem in the joy in of the way the many-to-many was working it turned out that the randomness wasn't good enough and we didn't have enough unique keys so the Cartesian product was coming back with the duplicates joining to the duplicates so now we use the PCG random number generator rather than the Mersenne twister and that that generates unique keys and the result comes back so it's base it's not it's just the first step is just sending around the key columns it's not sending around all the data it compresses together the combination of the columns in the key so although I'm just showing one column here it's it's just a simple test but it does work for three or four or five columns so it sends around the key with the original row location and then once it's which is quite there's not that much data then once it does the join on the key and the row locations and it goes and fetches the data so it's a single fetch and it's all queued internally inside h2o so there's a pool and the threads are allocated the next piece so it doesn't naively just go and start up thousands of threads your questions yep what's the numbers in the in the middle oh it's purple it's purple so I think air Tom one of our you've saw presenting earlier he he wrote it I think or he strongly modified it but if you there's a repo on github with Tom K's name to it I think it came from as you're originally because as yours on the first companies to have over a hundred cause so they had this very very wide Perth bar and then they made it in two rows but yes what it's quite good it's a it's an average of the CPU over some number of milliseconds so it kind of pulses like this but you have to imagine that really the CPUs are kind of bursting underneath but it's better than H top because you can see you can see these 320 cores in the middle anymore because it finished you can do multiple columns in a key yeah and if one side has three columns you can join to say two columns and the other side you don't have to use all the columns in the key yeah yep so I think is it oh I'm looking at the the log file and log file hasn't told me that's completed so if I go to here then it has completed and it's returned just under 10 billion rows so we have a look at the the head comes back instantly and that's you see that the the smallest keys this was random data with between minus five billion and plus five billion so the minus five billions have come back and we need to change that formatting so that it prints out as a big int rather than rounding a numeric that you have that's the minimum key any more questions otherwise that's that's all I had

Info

Channel: H2O.ai

Views: 1,943

Rating: 4.75 out of 5

Keywords: H2O Open Tour NYC

Id: 5X7h1rZGVs0

Channel Id: undefined

Length: 16min 46sec (1006 seconds)

Published: Thu Aug 11 2016