Talk by Matt Dowle, Main Author of the data.table package in R

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

thank you so I don't have any slides prepared but I'm going to drive it from the data table home page so that is our - data table comm and that'll take you to the home page here let me just make that a bit bigger how's that you can see that at the at the back I hope so I'm just going to give you a tour off this if the website how to discover the package how to see the history and how to read it so the first thing is that data table was on us TV so let's click this and it takes us to the PBS newshour article they start this review is written by David Smith at Microsoft which is a really nice review that mentions lots of packages in our that are being used by the city of Chicago and then it links to and just one of them is data table and links to the the PBS newshour site so this is just playing thankfully it's working okay over Wi-Fi so the city of Chicago they have they concerned about restaurants food safety so their officials go and inspect restaurants and the way they've done that in the past was on a Richard schedule once a year visiting each restaurant and now what they do is they follow Twitter and they look to see at Twitter reports when people have left restaurants and are ill the next day they detect that and if two or three people have reported being ill and have mentioned a particular restaurant the city of Chicago pick that up and send their inspectors quicker than otherwise so they're using our to do that and several packages and this was this was what the the article was about so I won't play all of it but then they're going into the data science and at about six minutes what was it six minutes forty five you know they're looking at the data I won't click again so they're just explaining the data science it's all open source it's on github so this is the data their website and open-sourcing the data is what I say think and then in a few seconds so they've got this GUI to view the city of Chicago and all these restaurants and then there's the data table code so on on us TV so let's have a look in a little bit more detail at that syntax because it is a rare example of real data table code out in the wild so this person has managed to find out how it works right they read the documentation there is documentation that exists it's not that hard so first thing to notice what isn't it it is an SQL there's no SQL here there's just a few lines of our code and it's one line per query there's no select or where or group by because those appear over and over again and I've been in organizations where there's a million lines of SQL code and there's just so much code yes it's a bit easier to learn to start with but once you got used to it and you're using it day in day out it's a real pain to have to see select and group by and where all of over and over again so you do need to know a few things first the query is within square brackets so dat well that's a data table I I test that could be row numbers and it's simpler if it's row numbers it's looking up those row numbers in the dat table so I I test is a variable which contains maybe three and four in a vector and it's looking up rows three and four or it might be another data table in which case it's doing a join and it's looking up the rows in dat using the rows in this variable now there's no left and the right I always get confused between left and right in sequel or sequel like implementations I always prefer to look up these rows in dat hopefully you can see my pointer moving with these rows and that's the way round it is and given that look up then it's running an expression this is just an expert our expression this can be any our package there's 10,000 packages you can call them in J and this is a column name inspection under school date the people who've written this have given it a nice name so that we can guess what it is and range is just the minimum to the maximum and it'll be returned so it's the range of this column for this subset of rows in dat dot n that's just the count so rather than writing in desc ul you have to pick a column to count or you can do a count star well why why do we have to select count or something like that so dot n is built-in it just it's just a count and it's the count of this subset of that so it doesn't select all the rows and then count it does the does the join or it does a subset and it just counts and it's optimized within the square brackets and then another few things and these are being output to this to the the console and that's probably being used in a particular way the console output is being used and then if you want to save the output you just left arrow or equals to that to that so let's scroll down a bit more and a few few more things data table and the wild colon equals well that just updates a column by reference so if period here is a column name that already exists then this is an update and it'll updates that column now there's no where clause here because it's just comma so it'll update the whole whole column if that column doesn't exist it'll add it as a new column by reference to the existing table which you can't really do in SQL very well because you have to drop the table and recreate the table and all that so that's why we're using R and Python because we can add columns by reference very easily and then key by that's just it's just like a bi but then it keys afterwards in one step so it's by you can have a list of any our expression on any column to have any comma separated list of expressions here on the fly and you're by columns will always come back in the results so unlike SQL where you have to re select the columns that you're grouping by to get them in your answer it automatically puts your grouping columns into your result okay so that's a quick review of what I mean by data table being seen on US TV so back to the back to the home page over on the right oh we can't quite get to the right because there's no scroll bar in chrome going to the right but I can just about see that so articles so that's the second link on the sidebar on the right so with so every time there's an article that either mentions or is solely about data table by anybody that we're aware of then we put it on this page where the positive or negative and this is a wiki I like this website it looks a bit messy and it's not as nicely it's not as pretty as other webs home pages but it is a wiki and it does mean that anybody can add their own articles here there's no access controls on any part of this website I'm showing you so it goes that the recent most recent one is yesterday from Marcus guess Minh he's a good friend of mine as it happens in London who was using data table at Lloyds of London and he's got some great packages and his own blog on on arm and it goes back and we can scroll down these are all other people writing about data table going back to 2011 six years ago so you can just scroll through the titles here to get a feel of what it's being used for and recently there's somebody put a series of data table by example out and these go on to our bloggers and so on now there's just one that I wanted to focus on which was by David Robinson Reese who who does a great service for the our community by his works at Stack Overflow I think and he does some analysis of which packages people are using so he did this article on the impressive gross that are recently because he'd first had an article and how pandas had been growing so people in the our community said well what about our so we did this one afterwards so pandas is growing more and everyone loves loves that more but ours growing too it's not so anyways we went through all this and lots of Statistics and they're just the one chart I wanted to point out was was this one this is the most mentioned our packages on stackoverflow in the are tagged so deep lie are is that is the most mentioned ggplot and then data dot table is the third and there's a big gap to the next package which is shiny now in the past people in the tidy verse community and the our studio community used to say that data table was at the top because data table was harder to use harder to learn and it's more tricky in all this but now that data D ply R is the top then they've had to change their story and now they say that actually it's a measure of usage and so there we go and then I was disappointed to see one tweet that said oh we're but all the data table questions are old questions so you know there is quite a lot of tribalism that goes on in these communities and it seems to be driven by you know the odd comment here and there and jokes which is a shame it's not really driven on merits all the time so let's have a look at these stuck over for the questions so we go back to the home page our data table and scroll down a little bit let's make it a bit smaller so I can see the toolbar on the right so there's the data table tag we click that so there's six thousand 400 questions in the data table tag but that's less than David Robinson found with the 10,000 and odd questions so he looked at data table mentions on Stack Overflow so if you know that the query here for looking at a tag is to put square brackets around the data table tag and it's really important to know how to use and search Stack Overflow when you're looking for answers because the tags and how they've been tagged are very important so what you can do is if you put if you search the are tag for data table without the square bracket around it and then - before data table with square brackets around it so it's not tagged data table and it is a question comes back to a 2,600 results so that's people then and we can scroll through and see what people are asking about data say I don't know why they don't tag it but it's but that's certainly if you look at the number of answers I think the answers are much less if you don't tag appropriately and then if you change that is question two is answer instead then you get 12,000 results which are answers using the string data table you need to look through and see whether that really is data table but if you look through then you can see that most of it really is and you can see like a library call around it and so on so it takes a long time to go through and work out how to search search properly so going back to the tag what's the latest active question 14 minutes ago so that's I didn't set this up it's you know it's that active there was another one two hours ago two hours ago so two hours ago three answers on so it's very very active got lots of people answering and it's this community of answerers that have really made the package so so great so let's have a look at instead of let's have a look at the top users so we click that so in the last seven days a third go answered that's not too bad of all time Arun who's my co-author on the project and I'll talk more about later is as answered the most and myself as well so I was known as supporting my own package but now in the last few years I don't spend so much time on on their Stack Overflow and then other people as well so these people like Eddy and Frank amnell Josh David Arenberg Roland's they're all like really nice people they're all hanging out here and just enjoy answering questions and then they learn how to line up themselves how to use the packages better so it's this community which really drives the open-source project so if we go back to the tag and we and we rank by votes so what's the most voted question in this tag so it's how does data table compare to deep liar so we can look that this is a great answer by Arun and it goes into many aspects of so it's the accepted answer what's the question has the question has 64,000 views and it goes through and compares them in terms of different dimensions speed memory usage and a little bit of syntax here so square brackets again easy to read so from the DT data table you're selecting all the rows where the X column is greater than equal to one and you're updating by reference the Y column to be n/a so very short and that'll be very fast because it's by reference there's no there's no copying and the other thing about it it's different debates are in that the right-hand side here is like Na in DNR is a logical type so it won't coerce the whole column to be logical if the column oh and he contains knots and ones it'll leave the column as the original types other convenience features like that so I'm gonna keep going for a few more minutes but hopefully they'll only leave time for questions at the end then the other quit the other question here is C votes is this one which I really hate this this question is why are pandas merges and Python faster than data table merges and are and the reason I hate it is that once you've seen that question you believe it don't you we all think all the last be true Sonya's written it on the internet and no matter what I say you're still going to believe it aren't you so it took you know where's always stood up at a conference years ago and presented this I didn't know anything about beforehand and he presented this benchmark and as benchmark you still see it hasn't been updated and it looks like pandas is faster and like I did my I was working at the time at Winston Capital and somebody came up in the office and said oh wise data dot tables so bad why is it so slow I'm like I'm trying to do my job I've got my day job to do I'm trying to do this in the evening I wrote F read on the train and sitting on the LAN floor of the London Tube writing the original F read and like so I'm trying to do all that keep my wife and kids happy at the same time and then I get this so I really I just can't stand this question so we did our best so I answered it I do have the accepted answer but where's house more votes you know the Python community love that more and and and so loud in in LA that's great that's because he has actually compared all the different packages nicely we need more of this in the open source communities to compare independently the different packages now this isn't comparing on enough dimensions it's just one aggregation and one join and so on but we need to all somehow be able to talk about the comparisons in a balanced and varied way about all the different just the trouble is it takes so long to produce this so anyway so this was produce that actually I'd join with data table and this particular one that he made was much faster than there and the other ones so it's competitive even compared to the MPP databases okay so going back to the home page the next thing is the videos and slides page so every time we present and including this one the video will go up and we'll just add it to this page so you can go back through history and you see all the videos and slides and there's a few way back in two thousand twelve and thirteen which haven't been filled in but sort of did since about 2014 have been quite good at updating this so as one wants to show here in 2015 July this was just after I joined h2o so h2o is I'm going to show who's behind datatable and the link to h2o but I went to the user conference in Denmark and made this presentation which was data table sorting and proposed that that be promoted to base R so this is h2o who's paid for me to you know travel and it's time to do this and propose it and it's been it's proposing to promote it to base our itself so I was I was surprised that well this was the presentation you can see this for yourself so I mentioned Tom short who worked with me on data table in the early days and he first told me about radix sorting in in our so we couldn't beat that for when it worked with a small range but for large range we improved it from sixty seconds down to 1.5 seconds on that example and promote and propose that be promoted to base are so 20 minute 22 minutes down to two minutes on a single run those are sort of benchmarks we like to see don't like to see like 500 repetitions of something that takes 50 milliseconds a single run that takes more than a minute and two or three runs at that is what we're kind of interested in speeding up so then if you that was 2015 if we search for we go to the our news file and search for my surname then there's a couple of ones I didn't even remember I'd done like this one about registered routines I just looked at that yesterday and no idea what that is but this one was in our three three so the radix sort algorithm and data table was promoted in his base R so if you're using R any packaging are that uses sort you're now using the code that Arun and I wrote in in data table and that was it took some time to get that into base our fix the bugs and work with Michael Lawrence at Genentech who is great and very helpful on that and had lots of ideas to improve and that's work that I'm spending thanks to h2o so if we go back to the home page again and the presentations and then keep searching up 2016 oh seven was a proposal that was just single threaded better sorting in base R which is already a good improvement what we can do better so I presented at Stanford at the DSC after last year's use are about parallel sorting so this is in the cran package on data table you can do this yourself if you install it and run F sort you can get this performance so this is parallel sort and since then I went through all the nuances of sorting it's not just sorting but it's ordering it's stability of sort it's the high cardinality missingness types and all that complexity so since then there's this we scaled it up to test it out and this is oops so this is on cram so you can do this yourself and try it out random doubles are UNIF 100 billion so it works on vectors which are greater than two billion items because our now has big vectors so we go using an ec2 x1 with 2 terabytes of RAM and you can allocate a vector in R with 800 gig which is 100 billion random doubles and then we compare it to intel's thread building blocks and we're about four times faster on this measure now I hesitate a bit because I've just complained earlier in the presentation where where's put up his presentation about being faster than data to table now now I'm doing exactly the same thing with Intel so it's it's very hard because it's going to take me a long time to so that's why I haven't done slides so that you have to listen to this presentation so it's important we what we've done is if this doesn't cope with skew so if you include skew in the input then the picture of verses and Intel's much faster so the Intel sort is more general-purpose and currently will work for more more use cases out in the world but it was a I wanted to show it anyway to show that we've been what we're working on and how we're thinking and the x1 on ec2 really has changed the equation on Ram 2 terabytes of RAM and whether we go for district lots of machines working together or one machine with lots of RAM so any other reason I did want to show this chart is that it is released and it is on cran and you can do it for yourself so it's not in dev and it's been there for over a year and I never got to finish it off which is a shame so I just so he'll they'll show it so it's in again a first single run it's like a real amount of time it's not milliseconds being sped up it's 40 minutes down to 10 minutes so you're saving half an hour on a single run and of course the reason we want to do sorting is is because that's how the indexes and data table work and there and the ordering we don't use hashing at all our uniques are faster because we're not hashing for high cardinality and then scrolling a bit further up in the history coming a bit more recently is parallel F reads so Google in April I announced that F read was parallel in dev and asked people to use it and they they really have to my surprise and then we have a lot of really thorough testing so it's still not released to cran but we do have as I'll show you the status so go through that benchmark basically shows you how to get the package it's very easy to download it's multi core and a machine with over a hundred CPUs working away so people on Twitter have been you know just specific out one example of in the real world so what's this two minutes 42 seconds down to 24 seconds and it's not new so h2o first did this three years ago spark has a parallel file reader two years and there's paratext in Python as well so careful to acknowledge the prior art and it's not just about speed but its convenience so if you have 12,000 columns as this person has then you don't want to be setting column types for 12,000 columns or working out that one of those 12,000 changed changes type in 90% of the way through so it jumps into the file the the column guessing has a hundred jump points and tests a hundred rows at each jump point and two to guess the number of rows in the file it uses a mean and standard deviation of the sample to get a very good estimate of the output rose so it doesn't have to reallocate or copy all the little bits into a new objects as memory efficient as well as being fast so sometimes it's not just speed but whether it'll work at all so if you've got a large hundred well let's say you've got an 8 gig file and you've got 16 gig of ram then if your file reader copies once then it's not going to work but it will work here because it's not copying at all and so there's a lot of work that's got in there and so the presentations are all there anyway [Music] how we doing for time okay so okay so let's see if there's anything so there's one so in terms of the the setup for the project then there's Co coverage it's about two hundred and eighty thousand downloads a month the DEP C rank is quite that's a great website that goes through other papers that are referencing data table so that's the plates together the citations and if we look at like a recent forbidden ooh recently so pull request says five at the moment and it's this one I just wanted to focus on which was this stacking balance on Windows and I've been basically tearing my hair out for two weeks on this one bug fix so it's all the project doesn't need that many people to run it because it's all automated with app via and with Travis so this is a branch it's running automatically on every push to Travis and out via the whole the whole test suite runs and it's only across it's only failing because the code coverage doesn't work and that's because the nature of this bug is in the progress meter for large files and this and this fix isn't being tested the progress meter isn't running in the tests because they're all quite small to be running a reasonable time so that's so that's all quite nice with and so I tried lots of things it's all very detailed changing like the standard I'll see compiling from the new 99 to see 99 revealed I was using in al okay so I've removed that so the message was back in balance so I thought it might be an al okay that's the stack obviously problem but it wasn't anything to do with that it wasn't anything to do with /r from the progress meet in standard error it wasn't it to do with any of those things until I finally worked out that in the progress meter it's printing at the same time as some threads were adding to the global character cash and if the Google character cash already has those strings it was threads raised so if you have to be a fresh run in a new session of our studio for a studio to crash but it wasn't anything to do with our studio in the end oh but but oh all the users they could only reproduce it on our studio so a lot of the time is working out where exactly the problem is and so let's have a look at who's behind it so we go to insights as the pulse it would be nice if the pulse in github showed more than the last month please get up and if we look at the the chart so it's quite an old package it was 2008 is when it was re-released to to cran and it's mainly been me in the room working on it up until maybe last year the year before so for example on the sort which made it into our which i think is probably the best achievement of the package so far and then recently I don't have to make a bit smoother smaller so if we look for from this range yeah that'll do so in the last year its myself he's a h2o employee Pasha who's a h2o employee yan gareki who was a hto employee and decided to take a sabbatical he's really good we want him back please yan and so he's not working for anybody's just traveling the world at the moment and we'd really like him back so three out of the top four contributors in the last year have been funded by h2o and Michael Jericho is a user of data table who used it to win being a team to win this competition and they won 140,000 which they shared between that team I think they may have jointly shared it with another team the winner is so it's not a clerical competition but it was run by the NIJ and they didn't haven't written the write-up because that they were the window and haven't had time to write their write it up so they're mentioned on the the winners the other witness page and their the kernel glitches team so I think that the project is this is written by you I was a user I was using our I did the project to do my own work in banking Arun was a user who was using it for his own work in genomics we're all users and we all contributed because we wanted to use it ourselves unlike some other software projects which are produced by software developers for other users so I think that's part of the success of the package is that we know how one that we want to use its ourselves and so what is so what of me in a passive in doing this year so if we go to the I'm probably way over time online how we doing okay questions so I just want to show one thing so the so we've separated f read an F right to make it agnostic and use it in Python so I'll just finish with a demo of the PI data table which I know some of you are looking forward to so this is port of some of the data table techniques to to Python so it's out of memory 16 gigs of this is a 16 gig laptop loading in 32 gigs of data so it's a instance the shape is 15 million rows by 157 million rows by 36 columns so we didn't have to f4f read it in or feather it in its just memory mapped and we're looking at it instantly we're scrolling around it this is a live recording of actually using it from a cold Python session and then to run a filter it's going to be similar syntax to data table in that you put your query in the first there's no lazy evaluation in Python so you need to pass a lambda and that was a filter in compiling the Python to LLVM and it just ran on the out of memory data this is all largely done by Pacha and by nish ants to hto employees and you can do combined queries and that looks quite simple doesn't it the syntax is very simple you do the comma and then which columns do you want on that query and it just comes straight back and it's just using the the operating systems memory map files to save lots and lots of coding because once the memory mapped column is in memory then it'll just be faster the next time automatically the operating system does that for you so the idea is to it'll just be single machine but memory RAM is getting bigger of course and we've got X 1 with 2 terabytes of RAM and the and the syntax that people seem to be asking for is data table like syntax but in a Python environment so that was all I had questions [Applause] go I'm seeing somebody type on my what's happening oh that's not right I see is a crossover of your data table work in h2o will you continue to develop data table will data table syntax be emulated in a well I'm dependent on you I'm dependent on people out there to speak to Sri and I do what Sri tells me to so we do we I've tried to link I don't think people realize how much h2o is investing into the data table ethos so where there's a show just how much there is work at a detailed level so we need you to either pay for it or find a way to fund h2o to continue and we'll do what you want us to do can you define skew in the context of sorting right so if there's a skew specifically as if there's so it's a forwards radix so it's the most significant byte so in the first significant byte you count the bins for the 256 values in that first bite so if there's one value which is if it's a flat uniform distribution in that first byte then F sort works very well which is the case for uniform random doubles but as soon as you get say 90% of the count in one value then that's the skew and it doesn't do the second its second in the parallelism doesn't doesn't descend into the second level to sort this out so just one thread will do the the biggest bin in that first significant byte so depends on the layout of the data and you defines why is pandas so popular when it's syntax is so clunky and non-intuitive I don't know but thank you for your opinion that it's clunky and a counterintuitive I haven't really used it very much does the new F read require a parallel back-end no it just uses openmp it works on Windows Mac and Linux and it just works out of the box just using OpenMP so that's what people have already been using out there a lot okay thank you [Applause]

Info

Channel: H2O.ai

Views: 3,279

Rating: 4.757576 out of 5

Keywords: data science, machine learning, data.table, H2O World 2017, Product

Id: GHrebwrqZ-c

Channel Id: undefined

Length: 38min 26sec (2306 seconds)

Published: Mon Dec 11 2017