Juan Luis- Expressive and fast dataframes in Python with polars | PyData NYC 2022

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

thank you all for coming and thank you for the presentation so welcome everybody to expressive and fast data frames in Python with polars I tend to talk a little bit too much and I do want you to ask questions and have some discussion so I will try to be quick um so this is the outline of the talk and I'm going to do a bit of an introduction first of all and we're going to talk very briefly about uh pandas it's success but also some of its limitations and then we're going to mention very briefly what are the Alternatives of or some of the Alternatives that are available these days and then we're going to see an actual analysis uh carried out using polars and addressing conclusions so just to reintroduce myself I'm a former developer Advocate or at Rita docs I'm currently working as a data scientist Advocate at orchest which is a company creating an open source pipeline orchestrator and we're going to see very briefly during the demo and currently looking into ways to accelerate the solidarity economy through through technology in case anyone is interested and quitting Twitter so feel free to join mastalon and follow me there I'm cross posting in both places am happy to connect with folks on GitHub and Linkedin as well so first of all the context of why are we giving this talk so at orcas our mission is to empower data scientists and data Engineers by developing uh a nice easy to use by Blaine orchestrator and we identified a few use cases and we were asking ourselves whether our users were using the best tools available in particular for data manipulation um for for tablet data in particular so we started focusing a little bit on a data frame libraries and we wrote a series of blog posts which evolved into this talk today so first of all pandas who has used pandas already so lots of people it's very widely known and it totally reflects this plot that we see here so for those of you that don't know it this uh visualization comes from a stack Overflow blog post from 2017 that was describing or analyzing how come python had skyrocketed from like a normal language in 2010 at the beginning of 2010 to the most like the language with the largest number of questions in stagger flow right and if you break that down into a different projects and tags you see that a sure Django and flask has they have had sort of a steady increase and I would be super happy to see this plot again with like Fast API or some of the newer Frameworks but the most important contributing factor to this increase has been clearly the pi data and the Scientific Python ecosystem and in particular uh panels right so it's everywhere many people use it and it's been a super successful project however it has some um limitations so there's a blog post called the 10 Things I Hate About pandas written by the pandas creator that analyzes a little bit what are some of the bottlenecks that pandas has nowadays I don't want to extend myself too long because we happen to have Jeff rebach and they need pandas maintainer talking about the past present and future of pandas so I highly encourage you to attend his talk but basically we can summarize this bottlenecks in in in two categories right so on one hand most pandas operations don't take advantage of multi-core this has been evolving and there has been many people um re-implementing or refactoring pandas so that more and more operations take advantage of multi-core but it's not uh complete by any chance and even when it's complete the fact that pandas is eagerly evaluated and therefore when you have a chain of operations imagine the typical filtering and group buying and aggregation and so forth each of these objects um have to be stored in Python memory and therefore there is not much rune for ahead of time query optimization right and then on the other hand some of the design decisions of pandas have affected how the memory is managed and so forth and for example the way missing data is implemented in pandas and well it looked good at the time but now there have been people uh saying that it would have been better to have like a mask array approach so have uh basically a market rate that says which are the valid values right and then the values themselves which needs twice as much memory but it probably uses much more the the CPU and several other things like how strings are implemented and so forth and but again I refer you to get revac's talk and which I'm sure is going to be super interesting and I would like to clarify that I'm not throwing shade at pandas because I love it it's a wonderful project and it's not going anywhere but I would like to talk a little bit about the Alternatives so if you are like me six months ago you have probably heard about many different projects that are kind of like pandas but not exactly or like they deviate in some ways so the first thing I did was trying to map all these Alternatives and inspired by the famous Gardener uh infographics I created the data frames Charming quadrangle and released with I created Commons license for all of you and so we basically have this uh landscape right on the top on the bottom left corner we have pandas itself and Rapids which is uh sort of a panda's implementation on gpus right so it's still for sort of small-ish data frames or data frames that fit in Ram right but some of the operations are much much faster thanks to them leveraging the GPU then on the top left corner we have a dusk again we have Matthew brockling talking about deploying desk later on in this conference so if you're interested check it out and which take pandas to the distributed world right so this allows us to use essentially the same pandas apis but with data frames that are much larger than RAM and then you have all operates like modding for example that try to make the uh the process of switching to a distributed uh setting much more easy and the promise of modeling is that if you change one line of code just the import statement then everything works and on the top right corner I have this category that you call recovering hadoopers and because I don't really have any experience with Apache spark so I don't have anything meaningful to say it's kind of weird because we have all these uh projects right and trying to um accelerate python for large-scale data processing but then wise park has been there for a very long time and I think there's just like an historical divide that the communities are kind of different and they overlap only a little bit but definitely a spark is induced by lots of entities already and it's worth checking out as well and finally on the bottom right corner we have new projects that don't necessarily uh try to solve all your needs for a huge data processing but they deviate from the pandas API in some interesting and very meaningful ways so if you want to sorry if you want to know a little bit more in depth about Arrow itself or Vex I recommend you to check out another talk that I gave earlier this year that is recorded but for the remainder of the presentation we're going to focus on polars which for me is one of the most interesting ones so first a little bit of background what is Arrow who knows Arrow here okay so most um of the people who is aware of using Arrow at the moment okay so not many people using it consciously at least so well arrow is this language independent and memory format and it has bindings for many different languages right we have by arrow on python which is based on the C plus implementation but there's a rust implementation as well that I'm going to mention in a moment and many others so it strives to be a cross-language foundation and the interesting thing is that by having this common in-memory representation you can transfer data between languages very easily you can stream it over a socket or something like that so it's a really really powerful and I I wouldn't say this is like a replacement of a data frame library because it's quite low level and but it's no doubt becoming like a foundation for the next generation of data frame libraries and then there's polars who has used polars already before this talk so there's a handful of people okay so you're at the right place um so polars um describes itself as an in-memory Lightning Fast data frame library for rust and python so most of the business logic is written in rust on top of the rust Arrow implementation but it happens to have a python layer too and both are developed hand in hand the latest version was released three days ago they released a lot of versions very often because the project is quite young and so I highly recommend you to check the change logs and the interesting thing that we're going to see in the demo part is that it has a lazy expression system that sort of decouples the computation from the data itself so you can express what operations do you want to perform on the columns and then apply those to a particular data frame or whatever and the fact that this is a lazy expression allows spotters to optimize the the query ahead of time and to try to use as much um as the the resources of the computer as much as possible um so that was a bit of an introduction to uh to Bowlers and I want to uh comment that if you're in this talk you will probably enjoy some of the other talks that are happening and this this couple of days so I as I said I have a recording about um of an earlier version of this talk where I talk about about iro and Vex and you might have heard of duck TV which is uh SQL in memory SQL database for analytics that is super interesting so you're like a SQL person I recommend you to check this blog post that I wrote and that combines techdb and polars in interesting ways and then there's uh a tutorial on Friday about Fugue I don't know if the few people are in the room so make sure to check that out it's like it's not a data frame Library per se but it's a unified interface on top of spark desk Ray and then there's Ibis 2 and we have again right after this talk in the other room uh called Iris expressive analytics in Python at any scale so and make sure to check all this out okay so that was 15 minutes so I'm on time and we have plenty of time for demo so I'm going to verify that everything works there we go you can see the screen right yes Okay so well I set up a tiny orcus project to download a data set from kaggle that contains uh I think it's a one percent sample of all the stack Overflow questions and their tags as we're going to see and my objective is to analyze this data set and try to figure out what are the top python questions and what questions do you have python among the talk the tags and all of that so I'm going to make sure that and all the steps are behaving correctly which they should hopefully so there we go we have the data and I'm going to get in here and to not spoil the talk I removed all the outputs but I prepare the code to not make it a super risky live demo right because the light demo gods are not always super kind to me um so there were two import polars is this and import police as PL and as I said we have um a couple of CSV files and the tax one is uh quite manageable but then the one with the questions contains the body of the question and the title and so forth so it's uh quite big it's 1.8 gigabytes and the Machine I'm doing this in it's not particularly powerful and in fact if I try to load the data set twice it's going to completely crash the server and I'm not going to try that so trust me on that but this is just to prove that I don't necessarily need a super powerful machine to Crunch two gigabytes of data so polars has a read csb function and same name as the one we know and love from pandas and what these returns is polar's data frame objects that has um convenience methods as well as in Panda so it has a DOT head a method we can use and this displays the first five rows of the data set we can check that the type of the DF is polar's data frame and the first thing we notice here is that well this looks like a annotation of a data frame that we would expect but um the the D types of each of the columns here and we're going to see later on when we manipulate this a little bit more that the types that are available importers are a little bit more sophisticated at what we have right now in pandas so I'm also going to load the tags CSV and as you can see I only have the ID of the question and the tag itself so naturally there can be more than one tag per question so the ID might be repeated and in total we have one million 1.2 million rows for the questions and 3.7 um a million tags okay so not too crazy also to keep the the times the processing times reasonable for this demo um but needs to be handled with care right right and the estimated size in memory for this sorry about the audio thing I don't know why it's um so the estimated size of this in memory it's 1.8 gigabytes and right now one silly thing that I really like about polars is that you have this ASCII representation of the data frame if you use print which is super good for copy pasting in in certain cases and we'll have a described method that gives us some information and I can even do some simple processing like checking which are the tags globally the tags that are most uh widely used right so value counts does the same as pandas and then I'm starting these by counts uh descending order and displaying only the first one so I get that JavaScript of course is the deck that is most widely used then Java c-sharp HP Android probably below here it's python right so so far so good but then um the interesting thing about polars as I said is the expression system so we're going to check that out and the idea is that we have this pl.call so a column function that represents a column of a data frame we have actually using the data frame so as you can see here my data frame is DF but in B cell I'm nowhere using that DF so this is a generic representation of the median of the score right and then I can apply this to any data frame that I have the way I do that is by using the select method of the data frame itself so I have the select method and then I pass an expression say this 15 is an expression and then I pass that to select and it gives me the result right so the median the mean score of all the questions is 1.78 um which is not unexpected right so most of the questions have like one upvote because nobody saw that and then some of them have two and there's like a long tail of scores interestingly when I pass a list of Expressions then all of these are going to be computed in parallel so I have for example the number of unique IDs and the mean score again but also the maximum length of all the titles and by passing all these Expressions as a list since they are all independent from each other because I don't need to have the length of the titles uh to compute the number of unique IDs policy is going to leverage all the cores of the machine to make this as fast as possible and I can't even do something like this which is select all The Columns of type utf-8 and compute the lengths and then give me the maximum of that so I have the length the maximum length of all the columns that have UDF 8 which is quite powerful sorry I was a little bit too fast any questions or comments so far yes is it is polar is going to be more sticky so exactly this is what I'm going to show in a moment um but the idea is that uh as you say a panda's object D type is quite generic right most of the times it contains a strain but it could be like a list of numbers or virtually any other object right so it's sometimes not really useful but thanks to Arrow followers can identify for example if a given column is a list of strings for example or even a structs because it has the notion of of objects with different properties within a given column as well so we have we definitely have more granularity and we're going to see an example of that in a moment thank you very much yes you do windowed operations I think so yes I don't remember that by heart but there's a wonderful okay this is worth uh checking there's a wonderful polish book sort of a long in user guide there's a window functions and this one is quite useful it contains lots of examples and walks you through several things you might want to check out and with colors um um and it has a reference guide as well that contains like all the methods all the functions and so forth this one is a little bit difficult to navigate they exist in different websites by the way um so yeah to answer your question yes we have uh window functions windowed aggregations and so forth oops anything else okay so let me continue um and this is when I'm going to start uh showing some more interesting things because I don't know if it hap if it happens to you but in pandas especially when you're reading like Json data for example or doing some certain aggregations and groupings when you end up with a column of lists like a column that contains lists of elements it's not very uh handy to work with that in pandas and you have to use like the dot Str methods which is not very intuitive either um so it turns out that in Polish there's a dedicated namespace for arrays and lists of objects so you have the dot Str methods as as I was showing here like the length or something like that but you also have dot r dot lens which is something that you can use for a column that happens to be a list and but before I get there [Music] um let's talk a little bit about lazy evaluation here so um the questions data frame is quite large as you show and it turns out that if I want to join the questions data sets ah that was it so if I try to join the questions data set with the tags to know what tags correspond to each question right and I'm going to run out of ram I'm not going to try it because it's going to ruin my presentation forever but um you can try this at home if you have some spare time and so what would be the way to do this right and the answer is that by default polish has uh this eager evaluation so every intermediate step is going to return a Polish object but it also has a lazy evaluation mode now it's quite easy to use so after a data frame that it's already in memory then you put this dot lazy method and all the operations that you do after that are going to become lazy so they're not being evaluated just yet and for example in this case I'm joining with also a lazy version of the tags data frame on the column ID and I'm doing a semi-joint which is a magical thing that I didn't see in pandas and that does like an inner join but discard some extra columns I still don't understand don't fully understand how it works but it feels great and and I'm filtering only uh those rows where the lowercase tag contains the word python okay so I'm keeping only these questions sorting by ID ascending and to go from a lazy chain of operations to the actual data frame that you call Dot collect at the end which is quite similar to how one would do things in like dusk or something like that and you end up with this so I do have um the questions here yes and I'm since I'm doing a semi-joint the tag column doesn't actually appear here but I'm filtering and the questions that have that tag over there and this I left a note to myself to restart the kernel here because otherwise I will run out of memory so I'm restarting here and I'm going to um instead of reading the whole questions data frame to memory I'm going to do something else so instead of doing a read CSV and then a DOT lazy right that allows me to write a lazy chain of operations but on a data frame that already exists in memory I'm going to do something else and I'm going to use scan CSV so a scan CSV does not read the whole data frame in memory but it kind of points to it and then when you do and it returns a lazy data frame instead of uh in memory data frame right so it's better than calling read CSV dot lazy because I'm not actually reading that completely and I'm essentially doing the same operations um but storing the chain of operations in a separate variable and the cool thing about this separate variable is that you can also visualize when this is done after a few seconds you can also visualize how polars is optimizing this chain of operations internally so you end up with plots like this I'm not sure why and I don't know the Polish authors wrote all these mathematical symbols at the beginning I guess it means like number of partitions or something like that I haven't done a lot of research on that but at least we see that we're joining to data frames and we're filtering then sorting and so forth so if our chain of operations becomes too complex then polars hopefully will be able to merge some of those operations and optimize them somehow and what time is it okay so we have 10 minutes left um so now to the original question that I asked myself right so what are the top uh python questions well it turns out that um I have more tags than questions right so I had to join this somehow and what I'm going to do is to turn each of the rows of the tags into a single row per question where I have the list of tags like this and the way I'm doing that is by using this um you'll call Tag dot list okay so I'm grouping by ID and then each ID group and turning the tags into an actual list so rather than aggregating I am like concatenating all these tags to come up with a tag list column and as I was saying a few minutes ago you see the D type of that is list of strings so boilers knows that this is not just an object it's a list of strings and it has its own set of operations um and after I have the list of tags and I was kind of amazed that I could do this so essentially I have this dot r dot eval that takes an expression that I'm passing and then it's evaluating that for each of the elements of the list for each of the rows right so for every element I'm asking whether the lowercase version of that contains python and finally if any of those is true basically so I'm saying okay for every row tell me which ones contain the contain a tag that contains the substring python right because there could be Python 3 or python Dash something something so it's not enough to check for equality and it turns out that it works okay so all of these don't have python in the list of tags but then python bash multiline does have it and so forth then I share this with Richie the author of Polaris in the Discord server and told me no but that's too complex you can do that in two lines and without using the tax list intermediate intermediate thing I just left the other code because I know it just looks beautiful that you can do that but yes you can have the tag list and you can say okay for every tag and let me know if it contains the substring python and then turn that into another column right with the um with the group by and aggregation we were using before um and I think this is almost the end yes so now what I'm doing is repeating again the scan CSV I'm joining with the tag list that I have uh that tells me whether a question contains a tag that contains the word python filtering that and then sorting them by score and limiting the first a thousand of them so at the very end when this is done I will be able to select um like all the columns except the body like I'm doing here and then displaying the most highly voted python questions let's keep this a few more seconds hoping that the machine doesn't die yes there we go so we have all the questions containing python here and I'm going to increase the width over here a little bit config set so how do I randomly select an item from a list that happens to be the most highly voted python question manually raising throwing an exception in Python these are right quite all right like 2008 2010 and makes total sense right like all the questions have more upvotes and but yeah that's it so happy that nothing broke and uh is there any follow-up questions before I move to the conclusions yes and to not run out of memory basically because when like reading the whole questions data frame in memory is possible like I have slightly more than two gigabytes of RAM but then when you're going to join that with something else it was happening to me that it was crashing so I'm using that to basically go beyond what the ram of my machine would allow yes and one follow-up to that before it turns out that and one month ago Richie implemented what he called streaming support and managed to turn like 80 gigabytes of CSV uh on a machine with 16 gigabytes of RAM which is impressive I didn't want to test that um because I think that would be too dangerous for a live demo but apparently it worked and I'm looking forward to use it yes what about desk um maybe it's closer than you think so this is a screenshot that I took today so nine days ago today Matthew Rocklin in an issue that someone opened asking what about desk said well we have a prototype like dusk on top of polars and basically he's sourcing feedback from the community to understand what are the use cases and so forth so um the key Point here is that polars started as uh in-memory data frame Library very similar to pandas right now it has gone a little bit beyond that so we can do out of core computation and therefore load data frames that are larger than Ram but to take this to a huge distributed setting and you will probably need some or chestration and things like that and I would say that dusk plus polars is going to be the answer that we're looking for hope that answers the question kind of okay yes focusing on so a random sample but what do you mean a random sample to get the same results some of the Rose the rose for him I guess for like some aggregate statistics like the median score something like that would be perfectly fine to sample in a smart way though I'm not aesthetician so I would probably do it wrong but uh would be okay to sample but in this case since I'm trying to get the most highly voted questions I guess I had to use the whole data frame anyway and unless I'm missing something so um for the use cases where you actually need like all the data or something like that resorting to an alternate days a little bit faster and could go a long way yes when you're sure that like the data frames you're using are small for example I was doing a small analysis at home of some data that I scrapped from my website and it was a bunch of Pros I don't know hundreds of rows or something like that and I didn't bother doing the dot lazy dot collects um because it's also slightly faster and probably easier to debug as well like I expect the errors to be less cryptic probably um but yeah apart from that when you know that the data set is going to be huge probably I would use the lazy interface every time or I don't know where they are but like the first three commands are successful and then the fourth one it failed so if you're using the lazy evaluation I don't know how would that work if you're using the eagle one then probably you can sort of kind of bisect with your your problems are coming right and there's also a way to teleporters to not to optimize the query planning just in case that happens to be some sort of bottleneck or or mistake in the optimization so there's a couple of scape hatches but I haven't yet used this in in production for like a right would work if you did it right right yes so um I don't remember there's like a direct SQL interface but for example you can connect to several databases and using this connector X Library or psycho pd2 but if you want to run SQL on um a file I definitely recommend you to check the TV that's exactly what you're looking for I think I think I run out of time already so real quick and what about the index there's no index anywhere James Powell is upset and I can understand so does it matter I'm not sure but it's worth discussing anyway and yeah pandas is okay but followers rocks and thank you very much and happy to connect as I said thank you everybody and see you around

Info

Channel: PyData

Views: 176

Rating: undefined out of 5

Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, learn, software, python 3, Julia, coding, learn to code, how to program, scientific programming

Id: LGAHTp4DYZY

Channel Id: undefined

Length: 38min 15sec (2295 seconds)

Published: Tue Jan 24 2023