Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we are going to be talking about uh pandas two poers and duct DB uh which is a question that I'm getting a lot here so if you want to follow along uh um this there aren't slides per se everything that I'm going to be talking about is from a notebook so um if you go to that link that I shared in in the resources um that goes to a GitHub page so uh and in here you can see is uh let me bump up the font here um this this is The Notebook that I'm going to be uh presenting from today on in my GitHub and uh one of the cool things about uh GitHub is that you can come up here to this code this green button here and if you click that um there's uh this code spaces here and you can hit uh plus to create a code space what is going to do is going to fire off a virtual machine on github's uh infrastructure here and uh bump up the font so it is launching vs code here and uh here is The Notebook right here so don't click on this notebook if you if you launched the code space give it a moment it's actually going to install all the dependencies here so there you go you can see it's uh installing the requirements this will take a moment for this to to run and then when the when these requirements are done uh you can click that and it will open everything up in in Jupiter and you should be able to follow along uh from Jupiter right there without having to do any install or anything on your end um alternatively this is a python conference I assume a lot of you are familiar with installing packages and whatnot uh there's a requirements file if you pip install that requirements file you should be able to follow along locally as well okay um awesome so we got other people from Italy Greece UK uh Colorado oh hi Mina I'm you're right around the corner from me and uh moist door set United Kingdom awesome so um I'm excited to like I said I'm excited to be here excited to answer questions or concerns that you might have um I'll try and put uh or try and uh monitor the chat every now and then so if you do have a question uh feel free to use that chat and um Let me let me answer them okay so um you know back to back to um GitHub here uh this uh has finished installing I can click I'll just double click on this so you can see how this this would work if you're running this at home wanted to follow along um so here is The Notebook for today and you can come down we don't need to run this pip install but we can come down and run this one right here and um it looks like that ran um and yeah it looks like we're good to go here so I'm actually not going to use this um I'm going to run it locally on my machine but just um um awesome that you can you can run that if if you want to and follow along okay so I'm going again I'm going to be talking about pandas 2 polars and duct EB a little bit about me um a lot of you might know me um as the author of this book effective pandas um to to those who have this book who have read it thank you I appreciate your support um just FYI effective pandas 2 is U basically done it's waiting for a a a a final once over by Panda core developer um and um and I'm actually in the middle of effective pollers so um that should be fun um so a little bit about me um why why am I talking about this so when I graduated from college I um started working with python doing data science back in 1999 we were doing natural agage processing this is before things like Spacey and nltk existed um in 2006 I made a python olap engine at a company and then later I heard about pandas and a lot of things that pandas were doing were very similar to what I was doing and so I I started using pandas uh professionally uh started uh uh writing books about it um and so I wrote a book called learning the Panda's library in 2016 um did some stuff with spark in 2019 uh wrote the second edition of the pandas cbook in 2020 um in 2021 uh wrote effective pandas and when I wrote effective pandas um at that time I had had been uh teaching and uh Consulting with pandas uh for a good chunk of time and I actually had some pretty strong opinions about Panda's usage and and what that should look like and again if you've read the book you're probably familiar with a lot of those and I'll I'll kind of go over some of them in our course today and then you know last year I did quite a bit with qdf qdf if you're not familiar with it is NVIDIA tool probably the fastest pandas imple uh pandas implementation also probably the fastest data frame implementation uh basically implementing the Panda's API on on top of gpus um did a lot with modin was an adviser for a company called pollers which basically took the modin technology and made it so you could run your pandas code on top of snowflake um and did some with polers as well so so that's kind of um where where my background is uh in most of my time these days is spent doing corporate training uh Consulting um that sort of thing it in I like to say that I I sell snake oil and teach people tell lies with data but I go into big companies and help them Learn Python and then help them use the python data stack so to speak so a lot of pandas a lot of um visualization a lot of um U things like actually boost okay so so pandas 2 for those who aren't aware came out a while ago and um you know if you are using pandas uh we we you just uh say that in the chat so I I know like who has familiarity with that also if if you want to just say like what domain you're using it with um that would be awesome and maybe if if you're so motivated you can say like what your favorite feature of pandas and maybe the thing that maybe you dislike most about it as well um so uh a lot of PE I'm I'm sure most people know here that python is a slow language so we're we're at the slow language conference which is kind of weird that python is a slow language but like it's it's also the language that's most used for data science which like processing data and so so what's the deal how how does that even happen and I I blame this conference on numpy again probably preaching to the choir here but basically what numpy does is numpy gives us a an escape hatch from the slow language of python basically encapsulating an array of numbers and instead of having you know in Python everything is an object so when you make an integer you have an integer object and there's overhead there's like 30 bytes of overhead for that object on top of that and instead of having that per uh value overhead what what numai said is I'll give you overhead but it's going to be for a block of memory and so as long you stay in the playground of numpy you basically get C speeds and a little bit more greater than C memory usage but close to C memory usage you don't have to pay that python penalty um awesome so we got people who are doing Financial modeling very cool 10 years of Pand as well you should probably be giving this uh Brian um you know messing with time series data yeah so pandas has background in finance so great for time series data um everything rescue boat operations that's cool um earth science um replacing Excel yeah awesome okay so so basically numpy was that escape patch um numpy is great for arrays of data not necessarily great for tabular data or structur data like we see in Excel or uh spreadsheets or SQL so pandas is a way for us to represent a heterogeneous data uh columnar data and and basically pandas one which I'll generally refer to as pandas one or Legacy pandas in this talk um uh Builds on top of num high so pandas 2 came out a couple months ago uh you know there a few new features sort of the main ones that I think people should be aware of is uh piie Arrow replacing as an optional back in for numpy again stress that this is optional right now you kind of have to um tell it that you want to use py arrow and um uh and then copy on WR so so um what does py get us at a high level so so I imagine most of you are familiar with pi Arrow but basically it's a standardized in memory representation of tabular data the idea being that you can say like here is a p or a blob that represents some tabular data and then tools that support the arrow format theoretically you could say I want to suck this into this tool and it could do a zero memory copy sort of thing and just load in that memory blob and then be able to access uh that data from another tool um I I think that's kind of cool for for my use cases um um not like do I jump shift from because of an inmemory representation not not NE necessarily um but uh what this what this also gives us is like uh the P Arrow implementation of this basically has uh a bunch of tooling to do uh pandas and data frame like operations and so pandas 2 can leverage that and because uh this Arrow back this Arrow specification is basically meant from the ground up it's built from the ground up for tabular data can do a a lot of things much faster so namely what you get from Arrow for pandas users is you get better string type so Legacy pandas because numpy doesn't really have a string type to stick in an array what it would do is make an array a column for your strings be a numpy column that would be backing that and then and the values of that would be basically pointers to python string objects you at a high level not very much more efficient than having like a python list with strings in it that's kind of what it was except instead of a python list it was a numpy array with strings in it and so when you did string operations you got you know to to to mess with strings but those were all python strings and so you had the overhead of python there uh pyo gives us basically a a trim down so you'll probably say 50 to 70% memory usage for if you use the pyo data type for for string and then the copy on right there is um uh and pandas too kind of fixed a lot of these copy issues as well but um um being just smart about uh when you do something do you need to copy it or can you just pass the original object and then only copy what's changed basically the idea there so folks who are doing a lot of chaining probably important to have this cop on right and if if you've read my effective pandas book I'm I'm a pretty big fan of chaining so again you if if you're falling along um at home um and you're using that notebook that I showed you on GitHub you won't need to do this but um otherwise you probably want to do something like this um for for this purpose for pandas the main thing here is when you pip install pandas and you upgrade it you'll get pandas too if you want to take advantage of pie Arrow you need to explicitly install Pi arrows of dependency as well um again it's an optional backend for um pandas so I'm going to um um run this piece of code here this is basically making a data frame um so this is uh some f financial data here like cancel date of some product um when did it start when did it end um and and the revenue and the payments here and um so we're using our friend PD read CSV and uh the key part here for pandas 2 is saying D type back into this pie Arrow so what does that do that's telling pandas to turn on H pie Arrow as the engine that that pandas is using behind the scenes so um once I've got that um you know if I look at this this looks like a pandas data frame however if I come and inspect this D types here um we can see that we have this in 64 Pi Arrow um note that it's using date time 64 here which to be pedantic is not a pi arot type there's a Time NS pyo um that that's uh the py type um and then it didn't uh actually uh wasn't able to to convert this cancel date here so you you see I I told it to pars these things but kind of did a best effort but you know sometimes it it kind of fell so I'm going to go through another step here and we're going to just like force it here so I'm going to say let's let's convert this to to datetime here and um we're going to replace these na strings uh with um with datetime here so we'll just try and run that there and that looks like that works so so on that note um couple couple other things about P Arrow backet so one it gives us strings which is probably the big thing but a lot of uh people were annoyed that nump didn't support missing integer values and so Arrow supports missing data type sort of from the ground up and so uh you can have an INT column uh like there could be missing values in like this rers some payments here and in this case um they don't have missing values but if they did have missing values it could still be represented as an integer type um so here I'm going to just uh jump through some hoops to uh convert this to the piie arrow um type there and you can see that this is this is the pi Arrow type okay um VF says they're having audio cracking issues oh that's not good [Music] um whoops it's it's it's it's it's okay for us so okay okay generally my Network's pretty good so so hopefully it's okay um okay so so again I I can uh you know change these uh types two explicit Pi arot types um and um so here I might do something like this where I make this little function called tweak and um I'm just making a CH so this is using that chaining that I was just talking about here uh where I'm basically saying okay uh I've got my data frame I'm going to make this column here then I'm going to convert this column to this type and then I'm also going to uh convert these other columns to the type so this is chain of operations generally I'll do this uh for most data sets I come across and uh I I like to write my pandas code like this put parentheses around this escape the new line rules but just write everything as one method after another uh do you have to do this no you don't but it makes your code look like a recipe I claim that it makes your code easier to read and use and once you get used to it um it's going to be very nice for Polar people um you know uh chaining is is even more strongly encouraged because uh when we talk about poers has the ability to query optimization on chain query and can do some cool things with that that pandis because of its eager can't really do them okay so so this this is my basic data set again not particularly interesting data set it's got like five rows of data here but I'm going to show um like a common operation here so says like suppose we're dealing with a company that offers a monthly subscription service customers can cancel at any time that are build at the beginning of the month in this scenario we need cancel calculate the unbilled receivables at the end of each month uh revenues that the company has earned but not build to the to the customer so you know one of the one of one of my pet peeves is like these pandas articles that you'll see on the internet it's like top 20 things you can do with pandas and um and what you'll see is you'll basically see like this listicle of like do this method do this method and one of them they'll tell you about is apply and they're like apply is cool because you can pass in a function and you can use the function so people think that um uh that okay well I'm just GNA use apply for everything and you can do that the problem is is remember python is a slow language and and sort of the key to um pandas is that we have this like numpy or P Arrow layer and then we have this Panda API on top of that when we're using apply we're kind of going through that layer and that slows things down and so you know you see people do something like that like this this is actually kind of a complicated operation here I mean I can I've done more complicated operations but uh you know we're actually doing a bunch of operations on a row here and uh this is like normal P like if you know uh this is um not missing and the cancel dat is less than the period um then we're going to return Nan here if the start date is less than period and end date is greater than Period start the revenue is greater than some of payments then we're going to return the revenue minus some payments otherwise we're going to say zero and so let just run this here and I'm going to make a new column called unbuild receivables and we can see that it looks like that worked at the end here um I got like a nan up here here and an end down here and then we've got those calculations that look like they worked so you know like I said a lot of these articles like show this and then I see people in the wild going off they're putting a Ply on everything which is which is quite slow uh whereas what we really want to do is something like this um so this is the vectorized version of this and U you know it's doing some calculations up here um and then it's doing another calculation here which is basically the conversion of the above uh to this down here here um issue with this is is that like this uses this wear which is kind of weird it's kind of like it's not like an if than else it's it's more like an else statement but um you do it on what you want to keep so we're going to stick in what we want to keep which is like the If part up here and then this is like the Els on that so we kind of have to rejigger our logic and it makes like writing like I'm fine like saying things I don't like about pandas or whatever the wear I think is a little bit awkward but you know this works gives us the same result and um you know people ask you know why why do I care and so I'm going to do a micro Benchmark of this and we're we're just going to Benchmark like the apply version of this versus um the vectorized version you can see like the vectorized version is what 200 times faster on my machine doing this you know the MacBook Pro M1 chip so you know that that might be interesting to you I mean this takes you know 600 milliseconds half a second to run this takes two milliseconds I can barely detect 600 milliseconds but you know if if you had you know you know this is 150,000 lines of or 100,000 lines if you had you know a few more million lines of code you can get up to where this this actually makes a big difference and and so my general again people ask me what they should use use General like benchmarking advice is use uh what is um fast on your machine and well and also bench mark on the sides of data that you're using in production because a lot of these things if you change the numbers it changes the performance and and what I mean by Numbers if you change the size of the data change the performance here um I've got a number implementation of this um for the sake of time I'm actually going to just skip this um it it speeds it up a little bit more but um that um okay yeah and and Angelo says on the code spaces it's like four milliseconds versus 1.3 seconds so yeah not quite sure what's going on with the code spaces there but um okay so if if You' got questions about pandas 2 let me know uh what those are throw those into the chat um and then I'll start uh talking about polars and sort of comparing and contrasting those but I I just wanted to to you know give you a kind of overview of like what what's cool about pandas 2 again from my point of view you got that Pi Arrow um and piie arrow in itself is is you know like the idea of passing around data for for my purposes You Know sample size one not particularly interesting per se but uh you know speed UPS on string operations certainly and and memory savings for Strings certainly uh cool um uh Brian says any backwards compatibility concerns or gotchas yeah that's a great question so again at this point in time Brian um the py back end is optional you have to enable it so you can take your legacy pandas code it should mostly just run uh some things are a little bit more strict in pandas to basically like if you're doing an aggregation and you've got like stream columns in the aggregation like you're saying group buy this and then take the mean it will complain and it will say like you can't do the mean and so you you have to like add in a numeric only is true to say only the mean on the numeric columns but other other than that like your your pandas uh Legacy pandas code should work on pandas too and some of it should be faster without even taking advantage of of the P Arrow data types um and then you're going to want and put thrown that D type back end uh that's going to probably if you got a lot of stream data save you memory and also incre increas increase your time there um there um as far as other gotches there um theoretically like a lot of things should just work um in practice um some most things worked I'll say um maybe are some rough edges like I said I'm finishing the book effective pandas 2 and like that has like what I consider more real world examples rather than like 20 things you can do listicle type things where it's like here here's you know I don't care about 20 individual things I care about here's my data and I want to clean it up and it's not just one thing that I do to it but it's a bunch of things I do to clean it up or to to analyze it and uh so that chain you know have a more comp a chain do those chains work with pandas to uh for the most part they do there are some gotas here and there where it's like um you know we have a pandas 2 uh or Pi Arrow time in the index and like exporting that kind of Fells there there are some weird things i' i' filed a bunch of bugs with uh the pandas folks and and a lot of them have been addressed that's kind of cool that pandas developers are open to like fixing those things up um so yeah at a high level a lot of things should work there there might be some gotchas there but um I think you know if you are a heavy pandage user this is um um probably get to try your code out see if you get like memory savings or speed improvements from this you you probably should especially if you're like doing cop On rideing You' got chaining going on there um yeah so Brandon says any reasons not to use py ARL again there might be some regressions again like 90% plus your code should work but there might be some sort of corner cases like I said P my experience is that the pandas developers have been uh pretty Swift to address those uh uh things that they want there not to be issues right but that's kind of a big change to pull let out and a lot of people are doing a lot of things with pandas so so they they might have not have complete coverage of that okay uh great questions um sorry Matt by the way there's another question in the Q&A section oh awesome yeah let let me let me address that and then um um yeah and I don't know how to clear off the screen or whatnot um but since I use pandas a lot I wanted to try poers for lazy evaluation schedule query Optimizer also native back in but all my logic relies a lot on pandas index um how can I migrate to polars or handle the missing index from Luca okay um great question um and um okay so I'm not sure if my screen is showing or that question is showing on there um okay um so look basically like the indexes there let's talk about polar let me see if I address your question um I'll say this at a high level uh Luca um it's easy to go uh from pandas to PO them back if you're using pi ar2 theoretically you should get that zero copy in there so if you've got things that are index based um you can do those in in pandas and then uh stick them into poers uh where you want um so so um it it might I might it might help me if you like tell me like what specific speically you're using the index 4 and then I can po possibly give you a better answer and you can put that in the chat if you want LCA okay so let let's talk about polers uh the the polers library if I can find my mouse okay there it is um so uh the the creator of pandas wrote a blog post many years ago about like things they hate about pandas and um I'm going to contrast this with with you know what the creator of polar said so Richie been the Creator poers I heard him talk uh in a recent talk and he described polers as I think he described it as like a query engine with a data frame API on top of it it also has a SQL API on top which I I think fundamentally is different than like Legacy pan P pandas even pandas 2 where I think pandas 2 you can describe that as has a data frame API um does it have a query engine not really it's basically like an eager engine so say like we're representing this data and when you do something we do the operation uh whereas poers is basically saying uh you know we've got 50 years of database research and things like query optimization predicate push down things like that and and if you squint at this these operations that we are doing to data frames are basically the same operations that people have been doing two databases for many years so why wouldn't you leverage that technology and and so again there's a sort of 10 Things I Hate About pandas and Rich claims that Polaris has addressed all 10 of those things that you know those those drawbacks of pandas I I I kind of view it as like you know pandas again was written like here's things I want to do to data here's an API that lets me do that and and so we'll do do that um coming back at that uh you know you could think maybe re-engineer that from the ground up and and do things a little better so this also leverages Arrow um it does not use Pi Arrow it uses a rust implementation of Arrow um so that's one thing to be aware of is just like um where pandas is sort of we've got does with basically numpy or py under the covers here this is basically rust under the covers um and most of the implementation of um poers is in R there's a a python basically wrapper around that but as you'll see when I show you this p uh python wrapper there's nothing really pythonic about it and you could imagine that this wrapper could be adapted to any language um basically because it's not really using features of python for better for worse um and uh whereas a lot of libraries have said like we're going to adapt the pandas API and kind of take the pandas API because for better for worse pandas is the tool that people are using in data science and the API is kind of the API of tabular data science uh polar is sort of thrown that out we're said okay the API as is has some of these issues and so we're going to throw out the API and so polers is is pretty different than than pandas U and so it has this notion of expressions and context so you can think of expression as representing a column or operations on column or columns and then the context is where you do those operations right where you can have here's an operation that does a filter and if I do a select with that it selects um uh The Columns as as booleans right but if I do a filter with that it Returns the rows where the booleans are true sort of thing I'll show some examples of these Expressions um but again the API is completely different so people who are saying like can I use polar instead of pandas um yeah you can but you've got to Port the code it's it's not it's not like Qi or Spark spark is implementing the Panda's API or even mode in where it's like we're going to aim to implement Panda's bugs as well because we want to 100% API coverage there that's not not the case at all uh pullers does have lazy evaluation so it can run in a lazy mode it also by default will run in an eager Mo well depending on how you create a data frame so so there is a a notion of a data frame there's a class called Data frame there's that's a class called series as well just turns out that we basically never use Series in polar we only use data frame or a lazy data frame you can create a lazy data frame as well and that lazy data frame is kind of one of the key um value ads or pullers um so another way I think of a data frame is like an eager uh materialized piece of data right it's got all the data in there where a a lazy data frame is it it logically represents data but it doesn't have rows and columns as is it's a a directed as graph of like operations to do with a logical representation of data and then when you tell it to materialize or execute it's going to go through this graph of operations and and do some optim query optimization and then execute that and give you the data at that point and um because it's written in poers and it's smart from the ground up and and knows how to do things like max out your course uh you know like this is described as a query engine it has uh basically worker cues for every thread and it can do smart things where it's like okay we're into this group ey operation and we can parallelize that pretty easily so we're going to throw on for each CPU uh it has a worker Q we're going to throw on these operations to these cues and we're just GNA let them run through and if one Q uh you know finishes we can move operations from the other ones to max out those CPU like that's that's the query engine side of it which which pandas doesn't really have at all right and so that that's kind of one of one of the big differences there um and again you know people said you know can will you know with python like dropping the Gill in in the future or like these sub interpreters can can python kind of do that with pandas um I mean it can to some extent but you're you're using python to manage those threads whereas this is all done and rust and so um you don't have that overhead at all you you basically have fast times for doing that okay so so I'm GNA actually go back to my same example here um uh let's see there's a question here from ad does po run distributed fashion can run so okay so that's that's a great question probably let me address that at the end uh with a table basically no it doesn't it's not distributed um and and when I summarize I'll kind of go over that a um okay so so let's load our data here this I mean this looks actually like pandas like it's not the pandas but I mean this looks like pandas here uh in sort of uh pandas we have the PO library and we're saying read CSV and if you look at this this look like data frame this is this is actually eager right this is just a normal data frame um couple things to note like it shows you the type here um so these are these these Arrow strings here but there is no index on the leftand side so one of the things that uh uh uh the poers folks thought is like you don't really need an index and and I know a lot of people use an index for various things but poers is like you know we've been using databases I mean databases do have an index but they're fundamentally different than than the notion of a um a panda's index and and so like you can think of a panda index not as a database index but like a row label right you need a row label and and poers says no you don't need a row label um you know if you want a r label you can make a column that has that information in it so again there was a question there um know how do I get around that um yeah so again if you kind of give me some insight into like what you're using the index 4 I can probably help answer that question a little bit better um Luca let me know okay so you know the you know as I as I just ran into like the API is different there there is some some uh API uh consistency um so let me so for example we can ask like what the D types are right and this looks different from what You' get in pandas and in pandas You' get back a series um generally we don't see series very often in in polar length Okay um me see that these are utf8 they're not uh string or dates so if we want to convert these to dates um I'm going to convert them and this is how I do that and this this looks completely different from pandas uh so we have this width columns this is a context and then inside of width columns we run these Expressions you can think of withd columns as like the assign method in pandas here but basically I'm saying okay let's let's uh represent a a column or series of columns or or group of columns and so we use this this PL call to create an expression so here I'm saying here's a bunch of columns here and then these um uh the expression engine has an attribute called Str strr kind of like pandas and then say like we can parse a date time and convert to a date and this is the format of that and so when we do that this is in eager mode so it just executes you can see that these are converted into dates now like that um so so interesting thing to note is like this says null here um so one of the things that's a little bit different and maybe confusing in in polers is that polar has um both null and na which might be confusing for some pandas users because pandas has like is null and is na which are synonyms I kind of wish pandas didn't have is null at all because nowhere in pandas do we refer to null things polar does have n and they're different from na so in in polar's land um Na is like a floating point that you did some calculation like 0 divid by zero uh for for like nump is na um that is that is an a in in Po and then null means that there's no answer there so so there's a difference between those and there's like operations is na a and is no which is a little bit different um so I'm going to show that same uh operation that we did with like the apply uh we're going to do that in uh polar here so this is the polar version of that and polar you know it doesn't have an if then else but it has basically the same thing you can say when and then and otherwise here and I can chain these together so I I I really wish that Panda's had some syntax like this it it doesn't um you know there's a a bug to to to kind of or feature request to do that but they haven't done that so far but I can kind of read this as like when this is the case here um and note that um uh I'm not passing in the data frame here and I'm just saying PL column um I'm saying there is some logical thing that's a column and the name of it is cancel date but this is not pointing to the data at all this is just saying that this this is a column call this expressions and we build up these operations on these expressions and then we tell the query engine to go like figure out how to tie that expression to the actual data and run this so like here you know when this is true then we want none as a result when this is true then we want Revenue minus sum of payments otherwise we want zero and otherwise you want none there and then down here we're going to say um you know this is what I did to convert my columns and I'm gonna say with columns again again you can think of with columns as an assign and I'm going to just call this function here and make a new column from that now you might think like why do I need uh can if you look at the documentation for with columns you can throw in multiple things things into with column so you can have multiple Expressions here um this is just one expression but you could have multiple Expressions why can't I do this in this in the same expression and the issue is is that remember uh this query engine is going to try and Max things out and when you have with columns it's going to say I'm going to create all these columns at the same time and so because this uh calc unbuild uh depends on these being actual dates and not uh strings there uh we need to basically Al have like a mutex so to speak or like this needs to execute before we run this next uh column creation there and we can't stick that expression in the same location for that but if you look at this you know we get the same results we get out there so I'm going to just make uh not that I can convert from pandas really easy I just say from pandas here and then I'm going to do my uh time it here and run this here so interesting to note like this takes five on my machine 5.7 milliseconds and I think I think we were at like four up here um two so so this is actually slower um on Polar than it is in pandas I'm not quite sure why that is um uh but um yeah you know you you see a lot of things like polar as fast and and polar um it's not that polers is slow um polers is fast it's a more intelligent um query engine and can run things on multiple cores but in this case uh it it wasn't and and uh one of the uh Panda's core developers um uh wrote a blog post uh you know kind of ranting on some of these benchmarks that you see or benchmarks you see like p po is like five to 10x faster on standard benchmarks and than pandas and they went through the code like oh well you know these people are kind of doing things in a dumb way with pandas and if you speed them up then you basically get things that are same order of magnitude um as Polar now generally polers is faster than them um where where polar kind of really shines is Group by operations right when you're doing a group by that can happen on multiple cores and uh pendas really isn't smart about doing that uh and so that that's kind of where where polar speeds up so so the Lucas Point um yeah um you know can we can we take advantage of polar speed um yeah you can you can convert uh because they're both using Arrow as a back end uh you can get these kind of copy free conversions from one to another and you could you could do um your operations in polar where where pandas is slow um and then push it back into pandas as needed if you need to do index operations there um but um Luca where you're going to get the the biggest speedups is is things like group by um uh but but you can so we I haven't really gone over laziness here so so let me give an example of being lazy here so I'm just going to write this out to disk here and uh here I've got basically my same tweak so this tweak function is what I had up above not that I I follow that same thing in like pandas M po I'm going to make a little function to clean this up um instead of doing re CSV I'm going to do the scan CSV and scan CSV does not return an eager materialized data frame it returns a lazy data frame here so when I run this here and this is the same code um this runs but instead of giving me results here it gives me this query plan right and you can see what this is this is doing this is not particularly interesting because it's not doing too much but you could imagine um you know where you have maybe a filter and you select different columns and maybe you do some group buy or something like that um this is not even you know at this point when I run this this not even loaded the file the the CSV file it's just a this this is what I would do um you know if you got various filters and and selects and it's smart enough that it can know that like when I read this CSV file I only need to read these rows and I only need to read these columns right so based on these operations you do things like predicate push down and other operations um and you get that specifically because there's a query engine on the back end and because it's lazy and because you're chainy right so you're chaining these operations together and um it can do the query optimization on all those OP operations uh whereas pandas you know you can chain uh does that give you optimization no you kind of have to manually do that like our our core developer friended when they looked at those benchmarks but you know if you're smart about how penas is doing things um for a lot of things you can be very close to uh polar speed okay so yeah um so so the question does polers run distributed fashion poers does not run in distributed fashion so I I say pandas um pandas uh the library I cont that versus Pand the API because there are things that scale out panda is the API to distribute your fashion things like modin or dask like that company I work for Ponder would take the API and let you run the API on things like uh big querer or snowflake right um so so pan API has kind of the answer from small data to um medium or large data where my definition is small data not to get you know into the weeds is small data is data that fits in into RAM medium data Maybe doesn't fit into RAM it fits on a single machine big data doesn't fit on a single machine Panda's kind to has the answer or solution for all of those uh scenarios whereas poers um has the small data and it does have some spill over you can have some streaming functionality where you can stream from disk and so you could imagine um you've got sort of the small and medium data but you don't really have the big data answer there now there is a company behind Po and I imagine Richie is probably working on something to do that but uh you know as is right now when you're working with polers you're working with small or medium data okay um um Matt there's another question another question here I will show on screen now yeah I got it there Nishan thanks um is there an advantage using polar over P spark so far polar is looking very similar to py Spark engine lazy valve Etc okay so so great question Nishan um I mean we can probably push that question off to our our duct EB part as well um but but yeah I'll kind of answer it here um so again pipar uh kind of you know can you run pisar on a single node yeah you can um it's not going to be as efficient as polers so I'm actually going to answer this in the context of of duct b instead of of um polers here but you you can sort of adapt that uh so so long story short ducky B actually um has implemented a py spark interface on top of duct DP that might seem kind of weird but their thesis is like a lot of people jump the gun on like going to Big Data like oh we need Hadoop back in the Big Data days and then spark came out it's like we need to use spark because that's what the cool people are doing and it turns out that a lot of people have data that fits in into a single machine or you can rent a big single machine on the cloud and you can fit data onto a single machine and not have to worry about this distributed problem and and slowing things down running over Network and so like the duck DB folks they have this thesis that you know a lot of people are running spark and they don't need to but there's a lot of overhead there's a lot of code in that wouldn't it be nice if you could like take your spark code and just run it on a single machine but run it fast and so they have a they're they're kind of doing like the reverse of like the modin and like that Ponder where that was taking the Panda's API and putting to Big Data they're taking this big data API and putting it to like small data um and they had an experiment I think where um uh they ran their code and uh benchmarked it on on their polers again caveat benchmarks whatever but anyway they they had a benchmark where they run their spark uh query on pollers and to replicate that on a cluster uh they needed 30 machines to replicate the Speed and Performance they got at a single machine there um so again just sort of demonstrating that you know if you can keep your things uh if use small data or medium data and and uh duck also has intelligent spill over to to dis so it does when call medium data um there's some pretty big performance gains that you get from not having to manage a cluster that infrastructure and cost and um being able to just run things quicker clip and so I I think that same thing sort of applies to polers um as well like you know if you can run poers code on a single machine but polar again the API is different than that um and and interesting that um spark um actually has a a pandas API for compatibility where where they're kind of going to that pandas API uh layer uh which again pandas isn't really built for laziness I I kind of hope that they get there but you know I don't I'm not sure that that that they ever will necessarily get to that lazy query engine side okay great questions um uh Paulo says what's the advantages of poers versus Daft um I'm not familiar with Daft um yeah um so it looks like Daft is some other um data frame engine you know I feel like I know a lot of these database engines but um I'm not um I'm just searching for it here get Daft okay um yeah I I I'm not familiar with Daft so I I can't really comment on on that I'll have to I'll have to check that out thanks for thanks for the question there okay um you think that someone who makes their money telling people about data frames would would know that okay um so the for the final portion here I want to talk about duck which might seem kind of weird like why talk about like a database when we're talking about data frames and tabular data and again if you sort of think about like what I just said about polars in the polers is um a a query engine with a data frame API on top of it well a query engine the database is a query engine and I would say that one of the most Advanced Analytical query engines right now is actually duct DB um so so duct DB for those who aren't familiar with it is you can sort of think of it as instead of SQL light which is an in process transactional database this is an inprocess olap or analytical column or database so so it's basically like you can you can run uh uh SQL in process like why would I want to do that well basically you know if you can represent you know these tabular operations uh in SQL you can run them in SQL and a lot of people have been using SQL for doing tabular manipulation and quering uh for many years so uh it turns out that yeah there's there's a powerful Engine with years of research and and work behind it and um It's relatively easy to use and um actually poers has a SQL API that it ships with so polar is kind of kind of like it can be like SQL is and and duck like I said um duck actually has a spark API on top of that as well so you have sort of this crossover where people you know a lot of people who are doing data frame things are like we like data frame syntax but a lot of people like SQL so you know can we get the best of both worlds um so a little bit about the background here a C++ implementation it's a single binary you don't have to install a lot of dependencies like in Python you just say pip install duct DB and you don't have to start up a server and do the connection you just sort of get it for free which is kind of nice um and then the other thing is that if you look at like these operations like uh if I go through like my effective pandas book there's a lot of like group buys and melts and pivots and and things like that which traditionally have been painful uh for people to sequel how do I know that because like I said in my history like back around 2006 I wrote like a olap engine on top of a like monb which I I'm a I'm a contributor to SQL Alchemy because I wrote the back end for monb which is kind of I would say like the the spiritual lineage sort of for duck TB uh way back when and and doing a lot of these operations that are kind of trivial uh in data frames are kind of a pain traditionally in uh SQL and duct DB has an extended SQL syntax that makes things you know like window functions and um pivoting uh really easy to do so you can basically do your operations that you would do with pandas or polar in duct Deb but you have I would say a query engine that that's probably faster and more optimized than than the polar's quty engine the other thing is like polar um you know until recently you know had had a few developers working on it whereas dctb had a lot of developers and dctb got I don't know some absurd amount of funding which um may or may not be good depending on like your view of VC or whatnot but there's a lot of investment into improving uh DB I would say more so than probably both pandas and poers combined uh Manpower behind the scenes working on duct DB so um which is kind of cool um okay so uh for our purposes you can pip install duck DB I'm also because I'm in Jupiter I'm going to use this jupy SQL plugin which is kind of nice here so I'm just going to configure that um and we'll just import dub and then once I have this two SQL I can say like I want to connect to like like I do a SQL light I say and memory duct EB we'll just say select 42 with that line magic there and this gives me it looks like a data frame um and that's because you see I have this Auto pandas that just converts the result to a pandas data frame here I can do a cell magic in addition to a line magic here and I can also like say uh throw that into a variable so here's my DF variable here uh for from running my query okay so let's let's do something a little bit more involved here um I'm going to say okay drop this table if it exists the bill table and then uh let's do this read CSV Auto right so there are a bunch of functions that dctb has given us that allow us um to to uh do operations that um uh traditionally are are more of a pain um uh than they would be uh okay so yeah let's run this read CSV and uh the result is a data frame that says success okay uh so at this point there should be a data frame called Bill Let's just query that I'm going to say select star from Bill limit five and there we go look at this this looks like a Pand data frame again that's because it's spitting the results out in pandas but um this this is our data our friend that we we just looked at now you might want to make sure that like this I mean this this actually says Nat which is not a time so so you know we might be okay there and we could inspect like the pandas D types of this and capture it as pandas here but we kind of want to look at what's going on with the database here so it turns out that like there's an information schema table so duck Foods itself or eats its own dog food I guess um a lot of databases use databases to to implement the databases right so so the metadata is stored in the database here and so there is a uh tables column this kind of interesting um there's actually an extended sequence or a sequel syntax here um if you want to say select star from a table you can just say from table you don't have say select star which is kind of nice and you can put like Wares on this or whatever so this is just saying like select star from this and it says like we have a table called Bill okay so that's that table I just loaded and there's another table called columns so I'll just say from columns here and we can see that um okay here are our columns those look like they're good and here's the types and the types are dates so so this is actually kind of uh nice in that like uh um Po and pandas kind of struggled uh with those dates somewhat uh more or less and and uh duct B just sort of s them up and didn't really have too much problems here um and Dominique says you know what what exactly do you mean by mediumsized data sets Okay so here here's my very basic uh description Dominique uh small data fits into RAM on a single machine uh medium data is larger than Ram but would fit onto dis on a single machine and then big data does doesn't fit onto that okay so Dominique uh tools like polers and duck have intelligent spillover to dis hence support for medium data but um poers or duct DB don't really have the scale out in the Big Data like a cluster of of duct DB or cluster of pullers if that makes sense domic let me know if that answers your question um okay so so let's do our complicated query uh W which was like our rece unbuild receivables here and we can do this with our our friend case uh statement here and this actually I mean looks uh pretty readable compared to like the pandas uh like where you know this is just like when this isn't if like when this is true um return this um otherwise um return zero and otherwise return return null here and let's run that and there's our unbuilt receivables this is running on our our 100,000 rows in there and it looks like it's doing the right thing there okay um so this does have Arrow integration one of the cool things about dctb is there there's a ton of features um you can so duct DB in addition to being like a query engine it actually um can suck in different things to query against so it knows how to query directly against polar or pandas so you don't you don't even have to copy it in there it can use its query logic to query those directly which is kind of cool but we also have some Arrow integration in here as well so here I'm going to read our uh Bild data into a polar's data frame and look at this I'm just going to say uh let's read that uh from our data frame here so instead of quering from that polar connection that we made I'm going to or sorry the WB connection that we made I'm going to read from this polar data frame uh so select star um and then note here I'm saying peel at the end uh this just converts the result to a polar's data frame just for fun I'm gonna actually get rid of the select star because I think I don't need it yeah we can just say from Bill right so we can um convert from we can take polers into uh duck DB and we can also get polar out of that um uh let's uh just capture the result here and uh convert it to pandas as well and uh here we we can just say DF at the end and we can convert to pandas um Addie says uh where can I get a copy of the the the temp bill um I I did that up here um at a um you just need to run the cell um let me find it for you um it is uh this right here lazy inquery optimization so this is in the polar section if you if you run this here it will it will write that temp directory that temp file for you a okay okay so so um so that's kind of cool like you you at this point you know if you have your data in polers or pandas you and people like to use SQL they can write SQL directly against that um if we need to like stick this back into pandas uh to view this so so one the things that like neither poers nor duct EB does but like pandas does is pandas has visualization right pandas has integration with map plot Lo and which I think is awesome I actually prefer to plot using pandas create map plot plots using pandas uh rather than using map plot directly because I think the interface from pandas is actually uh easier and more intuitive and often I have my data in Panda data frames already so why not so I I found like bugs and code from visualization so I think visualization is awesome um polar and duct EB don't have visualization right so if I want to visualize something um uh you know we can we can stick it into pandas uh with one line of code right DF and then we can use our pandas do plot to do a plot uh from pandas uh very easily okay so so that is is my uh duct B um again duct Deb also has that spark interface so I think this is this is going to be an interesting space to watch what we've got is we've got sort of uh pandas which is the linga franka and kind of say I'm gonna sort of spitball or you you know this this is kind of looking like that no one got fired for buying IBM from from a few years ago or sort of like um were pan is is sort of like this is established technology that a lot of people are using you know there are companies that have millions of lines of Panda's code uh they can't really directly go to poers um or duct DB without doing some engineering right to to rewrite that and and generally people don't like to rewrite things just to rewrite them if they're working right generally that's not an effective use of engineering resources we want to create new things and and better things and a things are kind of working using an old technology or whatnot um again just leave it be and let it let it run so so um I I don't see folks like migrating to polers um just for fun uh you know there's got to be a compelling use case there also another thing just to just to realize is that like pandas has had like a 1.0 release and pullers like um pullers the Cadence of releases of pullers is generally on the order of like or two a month right and and I have had cases where like my polar code has broken and and the API has changed right so so there hasn't been a 1.0 release of polers um not that I think like there's gonna be huge breaking changes there but but it's like this thing this thing is moving pretty fast right um pandas releases are kind of um every now and then there'll be a release but it's it's not it's not um you know the Cadence or clip that that pull going it's not to say that Panda's developers aren't responsive certainly they are like I said I've I've for my effective pandas 2 filed multiple bugs and most of them have been addressed very very swiftly by them so you know if you do have issues with that highly recommend doing your good turn as as an open source uh user and and filing those bugs there um let's see uh agile says Uh there's some API weirdness with pandas pyro I think due to incomplete uh implementation uh the DT excess or and Pyro having durations rather than time Deltas hopefully that situation will reprove yeah um yeah there are yeah some some of these things uh were were pyaro it is a little rough um so um and yeah again uh uh my advice there Giles is um you know open up those those lines of communication with the pandas core developers um uh you know it if you need to I mean it is certainly possible to mix and match right you can have a a panda data frame that has some py columns and it has nump columns as well and you can use both of those um so you know if the pyo data isn't working for you um you can do the the the numpy stuff and um should be relatively good depending on what you're trying to do um like the date stuff from from numpy it is going to be pretty fast and optimized so I don't think you're going to lose performance necessarily there where where the pain might be is is like you know if if you are one of those folks who needs to move the data into something else right uh like you need to move polar or duct EB um that might be a little bit more tricky or you might have some actual copy to do if it's a numpy instead of P Aro under the scenes um Angelo says probably tangential but any advice on how you deal with uh read variables as strings and duct EB um I need to mass rename a variable my language server cannot take care of but uh but not when dealing with variables in strings um yeah so so I think what you're asking Angelo is um that you have I like like to read variables as strings I need to mass rename a variable yeah I guess I'm I'm not I'm not really parsing parsing what your issue is Angels maybe if you could describe it differently I might be able to help um uh GMA says as a beginner it seems like there are a lot of ways to do similar things for small and medium data sets is best to start with one and learn it well and if so are there any that you would recommend at least initially um okay so so that's a great question I'm gonna I'm gonna scroll up a little bit I've got this little chart that I made here and um this you know I made this as an attempt to to like um uh hopefully be useful here um so so what we've got in this chart at the bottom we have scale um from small to medium to big and then on the left hand side we have API and thinking about this it should be a little bit different this blue box represents the Panda's API not the Panda's language spark is this pink one and then SQL is this orange one at the bottom again because spark actually implements some of the Panda's API um there should be some overlap here between the pink and the blue here but you see like pandas qie modin and death sort of goes outside the box a little bit like qy Moda name for like 100% API compatibility their um desk has some features that are different because it does have some laziness in it um you see that pandas and qy are are small data or qy needs to run on your GPU um modin actually will run on small data but because it has some overhead because it it has things for like scaling out data uh for really small data that overhead of like doing the scale out actually slows mod and down relative to pandas but like when you start doing a bunch of group buys and modin sees improvements over pandas so depending on your workload on small data modin might be an improvement whereas desk is is like uh yeah you you can do dask on smaller data but you're probably going to want to push dask onto like big data um so from spark here my face is kind of covering it but you've got um uh like spark is sort of big data in SQL you can do SQL spark but you can also do the spark API which actually has some pandas uh for Big Data there uh big query ask Google's uh big SQL um actually recently announced that they're having a pandas API as well so um you can kind of see that uh coming up here like you can run your pandas code on big query with that um and then snowflake um uh like snowflake has like polar or Ponder I mean they bought the company Ponder which is basically running uh pandas on top of that well snowflake hasn't officially released that so I don't I I don't speak for them and I'm not sure exactly what their plans are but the technology exists uh to like run pandas on top of snowflake as well and then uh you know polar does have SQL right and poers works for small and medium data and then polers has its own API right so it's not spark API it's not the pan API it's its own API as well and then duck duck has SQL it's great at SQL has spark API duck actually has another API which it calls the relational API uh which is again not pandas nor um polers but it's another API as well so so those I think those are two um axes of looking at this data right are those the other only axes I mean other things you might want to look at it's like sort of longetivity of product or how established it is or if you have commercial support for those those might be questions that you want to ask or even like developers right you know the question of like um you know if I'm learning something as like MMA said as a beginner what do I want to learn and you know I think uh right now probably you know if if I'm giving advice as to like you know I want to break into the market or something I think pandas is going to be you're going to want to learn pandas to start off with just because there GNA be a lot more opportunities with pandas um probably should have some SQL as well um so so that that would be where where I would start off with um David says well your effective Pand is tubu use the same data sets as the original Edition or are they all new examples um yeah so David effective pandas 2 um does have some new uh chapters but for the most part it's it's taking the existing code and uh running it in in pandas 2 which has some frozen counts right so if if you got pandas one um or you got affected pandas one uh some of the content will be similar but um I've had to go through every piece of code and and make sure that it works with with with py as back right um so I I think you know for for those who are heavy pandas users and like here's what the the the pandas one code Legacy pandas code look like and here's what it looks like with pandas 2 uh I think that's actually useful for professional pandage users who are looking to Port their code and then it has additional chapters on uh testing and debugging facturing interrupt um um I just wanted to give you a heads up that you have about five minutes remaining foration thank you okay thank you yeah so you're going to kick me off in five minutes here so um you got five minutes left to to ask questions that you might have folks um okay Angela says um sorry I meant is in the following duct TB from Bill P limit Bill P's polar or panas variable but it's encased in a string so you lose semantic representation of it as a variable okay um you need to mass rename a variable oh so you need to do like some search replace um yeah I mean and you're saying like your your whatever your language server on VSS code when you uh rename that variable doesn't rename the SQL query um or yeah so yeah yeah I mean it is what it is um I guess if you're concerned about that you know you can do something where like before you do your duct DB query you could just make some like the duct DB uh data frame and point it to that variable and then always run your duct B query from like not the variable but the new one that you just created the line before and then you you could sort of be consistent and get the refactoring there if that makes sense where you're extracting the variable Name by just making another Alias to it don't know if that helps or not Angelo okay um okay yeah let let me know what other questions you have um you know F folks you say like like which which should I use right which of these Technologies should I use and and the answer is an unsatisfying like it depends right you've got a million lines at pandas um probably if you're in that world it's not going to make sense to to migrate to another system right so you know you probably want to stay in pandas land right you're an established big company you're looking to hire a bunch of developers um poers might not be an option there right because there probably not a lot of poers developers a SQL might be an option um there might be a lot of SQL people pandas might be an option as well right you're a small startup you want to do things quickly um and you want to have things that you know run fast fers might be an option there right um or or maybe you say like I want to use duck and and then I can you know use spark if I need to scale out to that or you know you could say well I'm going to run it on qpy qy is going to be faster than duck or uh spark um or polar and so we're GNA we're stick with a panda's API right and use qy to to get really fast speed UPS there um so I I didn't demonstrate qy here but again that's nvidia's uh basically Panda API and and running that Panda's code on top that you know if if you're doing like a string rename um that's going to be a slow operation like to rename string columns or do Replacements on that on on on a CPU on a GPU if you got 600 cores and each of those 600 cores can can do uh change a different name and so it can run 600 times faster on the on the GPU okay thanks everyone so much again um you can follow me on social media I'm on LinkedIn I'm on Twitter uh if if your team needs training with pandas or polers or duck please reach out Matt at metake again effective pandas 2 coming out soon effective pollers in the works as well thanks so much for your questions have a great rest of your conference thanks for letting me be on here U my pleasure yeah thanks everyone
Info
Channel: PyData
Views: 4,896
Rating: undefined out of 5
Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, learn, software, python 3, Julia, coding, learn to code, how to program, scientific programming
Id: vy8VrhaYR2M
Channel Id: undefined
Length: 82min 15sec (4935 seconds)
Published: Wed Jan 31 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.