The BEST library for building Data Pipelines...

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you're a data scientist or data engineer or just someone looking to improve your skill set of working with data in Python now more than ever there are more options for working with that data you have pandas which is the tried and true data analysis software spark which is extremely popular for working with large data sets especially at big companies and then newer packages like polars which allows you to do similar actions but is optimized for Speed so with all these options which one's the best package to use well the answer isn't so simple because just like for many things it's not about picking what's best but it's about picking what's best for the job think of an expert Carpenter they have specific tools to work with each job everything from a small screwdriver to humongous power tools so in this video we're going to do some comparison of pandas spark and polars not only are we going to look at the difference is in the syntax and the way that we use these libraries but we're going to talk a little bit about what use cases we would want to use each in okay I hope you're excited let's go ahead and take a look at our data set so here I am in a vs code Jupiter instance where I've imported pandas polars and Pi spark now the great thing about all of these packages is that they were super easy to install all of them I just had to pip install them into my python environment I also have this flights file which is a parquet file with a fairly large data set let's go ahead and take a look at this and we can see here that on my hard disk this file is 1.1 gigabytes in size that's a pretty big file but not too huge but keep in mind parquet compresses the data so if I use read parquet from pandas and I read our flight file we could see already it's taken a while just to load this data into memory now if I I run DF info on this we could see some information about the data frame like that it's almost 30 million rows and note here that even though the file was one gigabyte on my hard drive now that it's opened up in memory it's about 13 gigabytes I'll also just quickly run a head command so you can see some of the data fields in here and then we're going to be using these delay columns to calculate some aggregations of both the Rival and the departure delays for each Airline now because this data is so large and we're going to do some time comparisons I am going to be restarting the kernel a good bit so let's go ahead and start out reading this in again with pandas and doing that transformation so restarting my kernel what I want to do is read in this file and then we're going to do some aggregations and call this DF AG we're going to group by the airline and the year and we're going to take these two delay columns the departure delay and the arrival delay and calculate some aggregations specifically let's take the mean the sum and then Max so let's go ahead and run this and we can see here now that it's run the aggregation has given us the mean sum and Max of the departure and arrival delays now one other thing I want to do is just reset this index and then we'll save this result as a parquet file again as our aggregation file one thing we need to keep in mind about pandas is that the data must be small enough to fit into the computer's memory otherwise pandas is not a good option so that's always something you want to keep in mind regardless of speed that the size of your data and the resources that you have on your machine kind of work together to helping you decide which tool you're going to use now that that's done running let's just go ahead and read the parquet file just to make sure that we have the results like we would want and we do see we have it when we run this using the other packages it should be the exact same size and we should have these same numbers but before we get too far with that I'm going to restart my kernel one more time and we're going to run time it on this entire cell this will execute these commands a few times and then show us the average time it took so our time it is done running here and you can see it took seven runs and averaged about 16.7 seconds each time we can also LS on this output data frame just to see what the size looks like so the result is pretty small but it took a lot of computation to get to that point okay let's try using polar's next so polars is very similar to pandas in its ability to do these aggregations and manipulation of data but it's written mainly for Speed let's actually instead of read parquet for pullers use scan parquet this is one of the benefits of pullers because scan parquet will actually create a lazy data frame which won't actually do the computation until at the end when we run collect because of that it can optimize the computation that it's doing and give this our flights file then we're going to have to do our group by aggregation so we can do Airline and year then we're going to aggregate from there so in our aggregation we actually provide a list and our aggregation looks like this where we call PL call on each column then we tell the aggregation type so average here and we have to rename this to a new column name so we use dot Alias to do that so here it is all put together I've done all the different aggregations and renamed them and we're going to call collect here at the end and I've executed that cell and we can see the result these values should be pretty much the same thing we saw when we ran in pandas I'm going to restart the python session one more time and add this right parquet to the end and let's run time it on this to see how long it takes to run so this pipeline ran in 1.67 seconds which is a lot faster than the pandas version and since everything fit into memory pullers can handle it just like pandas if we do it LS on this output file we can see the file is slightly smaller and I think that might be because of the data types of each column after aggregation here all right next we're going to try out Pi spark so for pi spark we have to import spark session from Spy spark SQL this will actually create a pi spark instance that's running in the background and we can create that using spark session Builder now we've given this an app name called Airline example so now that we've created the spark session this is essentially our entry point for interacting with spark using python now the interesting thing about spark is it can actually work across multiple nodes of computers and this is where it becomes really powerful because working on one machine may not be faster but the fact that it can scale up to many many machines makes the potential for running on huge data sets possible and that's something you can't get with pandas or polars at least not very easily another thing I want to show you is now that I have spark session running I can actually go to localhost Port 4040 on this machine and I see this UI which will show me all the jobs that are running on this spark session how long they took in a bunch of other stats so this can be really helpful for large data pipelines that you're running you also see here from PI spark SQL functions I've imported our aggregation types average Max and sum that's because we have to use these functions from PI spark in order to execute it Pi spark does execute lazily similar to polar's Lazy data frame so this actually won't run until we've done our final execution which here is to write to a parquet file you can see the simtax has some similarities to polars in that we have to Aggregate and then rename this alias but it's important to note that this average sum in Max were imported from PI Sparks functions these aren't the built-in python functions so I'm going to go ahead and execute this cell here on the user interface we could see the actions of the job being run it'll also list the completed jobs once they're run here now if we LS on this file we'll actually see that spark saves this as a folder so what this basically is is a Hadoop partitioned parquet file and we see we could read this as if it was just a file so just to be fair I'm going to put everything in a time it cell and I'll restart the session note that I did have to add this overwrite mode when writing the parquet file so that it would be able to overwrite it each time it runs this time it function again while that was running you can look in here on the UI and see all the different jobs that are running and now that it's done you could see that it ran in a about five seconds on average which was faster than pandas for sure but not as fast as polars remembering that we're working with data that can fit into one computer's memory so usually in those cases spark is not the best approach now one of the coolest things about using spark is that you can actually run this whole command without writing any spark code at all and instead writing normal SQL like you would on a relational database to show this as an example here what I'm doing is I'm using our spark session to create a temporary view called Flights and flights essentially points to this flight's parquet file which we had been reading from before now I can write normal SQL on this flights table to create our results and it looks like this if you're familiar with writing SQL these type of group I statements are like second nature so all I have to do is take this SQL I wrote and write it as a python string and pass this us into spark.sql and then I can chain my commands like before where I write the output as this temp spark SQL file and now we can show these files are on disk as this partition parquet file like we had before so one last time I'm going to restart my kernel add the time it I realized I didn't add the year here to my group by I'm going to rerun it adding it back over in the spark UI we can see all the executions running so it took about 4.75 seconds per run finally just to make sure all the results are the same I'm just going to read each of them in using pandas and reading in all the results we can see they're all the exact same shape and if I just run a head command on each there's a pandas version here's the polar's version it looks like it was sorted a little bit different let's make sure we sort them all the same way so by Airline and year and it does look like the results are exactly the same for the polars and pandas version the column order is a little bit different from this but you can see the values are the same for spark and this part forecast URL version looks like it saved all the column names as lowercase but the results are the same so this video gave you a side by side comparison of pandas polars and Spark you can see how they're all very similar in some ways and different in others and the tool you want to use is going to depend on the size of your data and the size of the machine that you're running on or if you're running on a cluster of machines I hope you learned a lot if you enjoyed it make sure you like subscribe and write a comment down below letting me know what you'd like to see in a future video see you all next time
Info
Channel: Rob Mulla
Views: 71,626
Rating: undefined out of 5
Keywords: data science, data pipeline, big data, how to build a data pipeline, data pipelines, data analytics, apache spark, spark sql, pandas, pandas python, polars, polars data science, data science pipelines, data processing, spark vs polars, polars vs spark, data engineering, rob mulla, data engineering pipelines, data engineering tutorials, data pipeline architecture, data warehouse, big data engineer, data pipeline using spark
Id: mi9f9zOaqM8
Channel Id: undefined
Length: 11min 32sec (692 seconds)
Published: Tue Feb 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.