Polars: The Next Big Python Data Science Library... written in RUST?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you've watched any of my videos before you probably know I'm a big fan of pandas which is a python package for working with and manipulating data at the same time I'm always interested in learning about the newest technology out there in data science so what if I was to tell you there's a python package out there that uses data frames just like pandas but it's orders of magnitudes faster I mean we gotta check it out right now in a previous video I did a review of some pandas Alternatives and the results were mixed at best but there were a few people in the comments who mentioned polars so in this video we're going to talk about polars the python package we're going to talk about what it is and how it was designed and then we're going to talk a bit about the syntax of polars and how you'd write polar's code and make sure you stick around to the end of the video where I'll talk about how I feel polar Stacks up to something like pandas and maybe if it's a game changer so what exactly is pullers and what makes it different than something like pandas now one of the key things is polars is completely written in Rust but you don't need to write rust code to interact with it because it does have a python package which offers a very similar way to pandas to interact with your data they say here the goal of pullers is to provide a lightning fast data frame library that utilizes all available cores on your machine unlike tools such as desk which tries to parallelize existing single threaded libraries like numpies and pandas so basically what they're saying is pandas is a very important data science library but one of its biggest downsides is it doesn't natively parallelize the processing across your cores on your computer then there are tools like desk which take the existing libraries like pandas and try to make them so that they are parallelizable instead of being built on top of an existing library polars is built from the ground up using rust polars is lazy and semi-lazy it allows you to do most of your work eagerly similar to pandas but it also provides a powerful expression syntax so it will be optimized and executed on within the query engine so basically what they're saying is pullers has two different apis an eager API and a lazy API eager execution is similar to pandas where the code is run as soon as it is encountered and the results are returned immediately on the other hand lazy execution is not run until you actually need the result lazy execution can be more efficient because it avoids running unnecessary code which can lead to better performance to get started using pullers you just need to pip install it and I found that it installed pretty quickly and easily now that we have pullers installed I'm just going to import pullers as PL and let's go ahead and print here the polar's version too so let's get an idea of how to read in a data frame from polars a lot of this is similar to pandas so we'll use PL read CSV now similar to pandas if we show this data frame in our Jupiter notebook here we can see a rendered version of it with all the columns and data you see that actually displays the data types of each column here when we display it now let's say we wanted to filter this data frame so we'll actually call the DF filter which is similar to pandas query and when we want to call a filter on it we have to actually identify the column in the filter by this PL call so let's take the sepple link and filter to when it's greater than five and now you can see that we filtered here only to sepal lengths greater than five but let's do some more so let's do a group by the species maintain order is going to be true and then we'll aggregate using PL all and we'll make a sum aggregation and then let's print this filter data frame just for comparison let's show what this same code would look like if we wrote it in pandas and you can see here our results are exactly the same so the syntax is a little bit different with polars but you still get the same result next let's talk about lazy execution so we're going to start here by reading the CSV but then we're going to call Lazy on it to make sure we're doing Lazy execution we'll do our same filter here and finally our aggregation now if I run this we actually don't see the result and what we see instead is this graph showing us how the execution plans to take place we just need to run collect at the end of our query and then it'll show us our result the building blocks of pullers are very similar to what we see in pandas we have polar series and data frame you can see here we have a random puller series or you can create a data frame object just from data that you've provided but more commonly you're going to be reading and writing from files and polar supports all the major formats many very similar to what you would see with pandas including feather files and parquet files here's an example I'm using read parquet to read a parquet file with some flight data once we do have a data frame like this we can run commands like a head command tail describe which will show us all the information about each column we also can sample this to get some random rows sampled where polars is a little bit different as you saw before is if we want to select a specific column we actually need to use this select method on our data frame and then give it column names that we want to select and we can use star to select all or let's say if we wanted to select just the origin column we would do it like this or if we want to select multiple columns we would just provide in a list of our columns to select or if we wanted to select everything except for certain columns we can do plx glued to exclude those columns now filtering is a way for us to filter down the rows at our column to a subset of the data so let's take our flights data frame here and apply a filter on it we're going to take the column of the flight date and then run is between and let's give it a few different dates and you can see now the data frame is filtered on this date column in between these two different dates that we provided it or let's say we wanted to filter a value in the column let's take this departure delay column and filter this where it's greater than 15 and where the column origin is equal to San Francisco so this shows you how the filtering syntax looks creating a new column in polars is a little bit different than what you might be used to with pandas so we actually need to use this with columns method on our data frame to create a new column this is similar to a sign in pandas in here we'll create a list then we'll run an execution on our existing column like let's take the average delay and then we will Alias this as average delay or let's take this column and let's say if it's greater than 15 we will call this long departure delay now if we run our head command on this we can see we have these new columns one has been assigned the average value of this delay column and the other one is true or false for if the delay is greater than 15. we can also do some group by aggregations and we do that by using this group by Method so let's Group by the airline and then we're going to call Ag to aggregate on this let's take this departure delay minutes and get the average value of this and then Alias it as average departure delay and let's also take the arrival time and get the max value of this and you can see here now we have every Airline and their average departure delay and Max arrival delay and I found sometimes the print command is a little bit better than the rendered version for looking at these results we can also combine data frames either by stacking them or doing some sort of merge and pandas you would do something like a PD concat or PD merge let's see how we do it in polars here I have two just sample data frames that they provided in their tutorial so if I wanted to merge or join these two data frames together I would just use the join method on the First Data frame to the second one and this is similar to pandas where we can say left on a and right on X and there we have our merge data frame this method does allow you to give it a join strategy like left inner and if we wanted to just stack these pandas data frames on top of each other we would use polar's concat this is just like pandas but we're going to provide how as horizontal this is similar to the axis in pandas that we would concatenate with now there are various polars Expressions that we're not going to go into many of which overlap with what pandas has you can see here we could take Airline and run unique on it we could also get the number unique polars also has value counts which is very handy and the thing about these Expressions is you can pipe them together which will speed things up this is sort of like in pandas how you can chain your commands polars does that similarly and with the lazy execution could be a lot faster lastly here we're going to touch on why bipolars can be so much faster than pandas and it's because of the way that it processes tabular data and specifically how it uses the split apply combine approach to parallelize the processes that it runs on your operations so you can see here in this example they show in their tutorial if you're doing a group by aggregation on one of your columns in your data set when it splits the data into different groups and applies your aggregation this apply part is embarrassingly parallel meaning that these could be done completely separately and if you had more threads to run this execution on it could be made a lot faster now if you want to read more about how polars makes this faster you can read in their documentation but essentially this illustrates how they use a smarter approach to multi-thread this processing so this is great but the main reason why you'd want to switch from something like pandas to pullers is because of speed so let's test out the speed comparison of this I'm going to read in this flight data set that we were using before in pandas here so let's take the same data set that we were working on before we're going to do some group by by the airline type and we're going to do the mean Min and Max values of the departure delay column and the arrival delay column now if we look at the results it looks like this we're going to add the time it command to the top of this cell it's going to execute a few times and give us some metrics on how long it took here you can see it ran seven times and it averaged 2.7 seconds each time now here's the same aggregation written in polars you can see that we had to select the column and then Alias it to new column name but the aggregated results should look pretty much the same so let's run time it on this cell and see how fast it is so there's no doubt here that aggregation in polars is much faster we have a little bit over half a second to run this on average compared to the almost three seconds that it took pandas and you don't need to take my word for it this is a benchmark comparison of a bunch of different queries that they ran both from Reading from a file or dealing with data that was already in memory lower is better and polars here is in almost everyone much faster than all the Alternatives so there you have a very basic overview of polars as someone who uses pandas in my daily work I become so dependent on it to do data manipulation and data wrangling of all the panna's Alternatives I've looked at so far I must say that polars is the most impressive it's optimized such that it can run much faster than pandas code in memory when you're working with multi multiple CPUs the downside of switching to pullers is that you'd have to learn the new syntax and to get the full speed Advantage you do need to understand how lazy execution works and write your code in a way that optimizes for that also polars is mainly for building data pipelines it doesn't have a lot of the functionality that pandas does for data exploration like plotting which is a game changer for me but I'd say it's definitely worth learning and if you have some really intense data processing that you're needing to write maybe consider using pullers instead of pandas if speed is your main priority thanks for watching this video let me know what you think of polars in the comments below it helps out the algorithm also like And subscribe it's completely free and it helps me out a lot so I'd really appreciate it I'll see you all in the next video
Info
Channel: Rob Mulla
Views: 166,496
Rating: undefined out of 5
Keywords: polars python, polars, polars dataframe, pandas python, python data, dataframe rust, polars rust, polars data, what is polars, coding in polars, lazy evaluation, faster pandas, speed up pandas code, speed up python code, python data science project, python data science course, python data science, fast data pipelines, data pipeline, rob mulla, speed up python code cython, speed up python, speed up pandas, coding with polars, lazy evaluation python, python dataclass
Id: VHqn7ufiilE
Channel Id: undefined
Length: 14min 12sec (852 seconds)
Published: Thu Dec 29 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.