Data Wrangling with PySpark for Data Scientists Who Know Pandas - Andrew Ray

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right thank you so my name is Andrew I am a bart contributor small-time sport contributor I've worked on spark sequel graphics and most importantly for today PI spark I'm a principal data engineer near at Silicon Valley data science previously I was a data scientist at Walmart I got my PhD in mathematics from the University of Nebraska really happy to be here today alright so the company I work for or named for Silicon Valley approach to doing data science and we happen to be based there we work in cross-functional teams of software engineers architects data scientists strategists and designers who specialize in data-driven product development and experimentation we work in blended teams trained in holistic strategic and agile approaches so we we offer data strategy architecture engineering and agile data scientist ok also later this afternoon I'm giving another talk on graph algorithms if you're interested come see us and if you want to get copies of these slides or any of the other talks that we've done at spark summit go to our website and you can put in your contact information and we'll send them out to you right away ok so what we're going to do today we're going to talk about why PI spark give a quick primer on spark how to get set up and running on your laptop compare it to pandas and then talk about some best practices and so this is really a very much introductory talk so if if that's not what you're looking for it's your chance - okay so why PI spark ok do you already know Python and pandas love data frames want to work with these big data ok pi sparks the answer it's what you're looking for it is the answer to all these questions okay what do you get you gain the ability to work with big data very very good native sequel support decent documentation say decent you lose on the other hand very amazing documentation pandas pandas that's been around a lot longer it has really good documentation has great plotting wise part does not and support for Enda sees that I mean if you know a lot about pandas already it's that's the basis for data frames it's called out indices you don't have that there okay so let's give a quick primer on how pi spark works are and spark in general so you've probably seen this blurb before spark is a fast general engine for large-scale data processing synopsis so it runs on distributed compute that is yarn Vsauce our standalone cluster can also work locally on your laptop has two main abstractions rdd's distribute a collection of objects and data frames which is a distributed data set of tabular data much like this pandas dataframe has the integrated sequel and there's ml algorithms that work on top of both of those if you heard a little bit more about SPARC there's also a data set thing but that's not in Pais Park okay so there's two important concepts to keep in mind when you are transitioning to PI spark and that is immutability and lazy evaluation so first off immutable what does that mean that means that when you make a change you get a new reference and the old thing doesn't change okay it's much easier to reason about fortunately pandas sometimes it's hard to know whether or not you are getting a copy of a data frame or a view of a data frame and so that's sometimes an area of confusion you don't have that here so I think that's a good thing but it does change syntactically a little bit about how you interact with the data frames and the other thing is lazy evaluation and that means that the things that you are asking it to do to manipulate your data frame don't actually happen until you request some sort of output but this is good because it allows a PI spark to fuse together the operations that you are doing and do other fancy optimizations to make that all run faster and just a very very simplified picture of architecture so we're all clear on terminology you the user that's a smiley face there interact with a driver which then in turn operates with executors that run on your master that operates on your data okay all right so how do we get up and running with Pi spark so there's a few different ways the first way is go to the spark website go and download it on tar it and add the bin directory to your path the second more recent option is if you are a Conda user which how many people use Conda excellent excellent you can you can use Conda to install it it's on Conda Forge you optionally make a new Conda environment it works great in the future very soon once some issues are resolved you'll also be able to pip install it which is great for people that either don't use condo or don't have the option of using Conda for other restrictions okay and then once we've done that it's just a matter of saying PI spark or if you want an i - shell a little bit fancier stuff you can you can put that environment variable first or you can add that to your profile or if you want to Jupiter notebook you can add those to environment variables and get that to launch when you run PI spark there's other ways of saying it up like creating a proper to proper kernel spec for Jupiter hub our Jupiter excuse me if you want to do that and you look on the internet it's fairly easy also note that this is running in local mode only if you want to actually connect to a cluster you have to add some more options on there and point it to some configuration files look in the documentation alright so let's compare it to pandas so probably the first thing you want to do is you want to load in some data so say read a CSV so you know very similar right okay the first example is a lie right okay the defaults are unfortunately a little bit different if you actually want to do this with a common CSV that you know has a header and probably data types other than strings and you want those data types in your data frame you probably want to pass in some options to get slightly more sensible handling of those and with all the data formats that that spark you read you can give these extra options that you know change how that that reading in happens all right you want to take a peek at it with pandas it's really easy because the default string representation is you know the data frame you know at least some section of it you have to do a little bit more you have to say dot show unless you're running on one of the more magic platforms like data bricks the default string representation of a PI spark data frame is just a schema not actually the data so you gotta have that dot show I'm not gonna you know throw that on to all the further examples but just keep that in mind if you're trying to so do any of these alright and of course if you want to just look at a specific number of of rows you can do that as well all right if you want to inspect the column names and data types exactly the same if you want to say change the names of the columns again because these data frames are immutable we can't just go make assignments to the column name we have to actually create a new data frame with those names don't worry they're not going to copy anything or do anything you know intense here it's just changing some metadata in a new object but it is a little bit different you can also rename individual columns as well although only one at a time if you want to drop some columns again very similar we'll note that you don't have an axis concept here and pandas there's the concept of which axis row our columns that you are dropping from because those rows are indexed and you can actually drop like targeted ones you can't do that in PI spark that doesn't make any sense so the only thing you can do is you can drop columns like that if you want to do some filtering again very very similar you can use the the data frame dot column name syntax to refer to your column you know make a predicate on that just like with pandas of course the dot column name approach only works when that's not so like a built in name and so in other cases you may have to use brackets and you can do complex ones just like pandas as well exactly the same and I will say it will give you equally terrible terrible error messages if you forget those parentheses on those compound ones if you know you get something like that add those parentheses okay so you want to add a column again because we're immutable here we can't just make assignments we have to do something else like this with column but other than that it's pretty much the same there are some differences though and one that might not be so obvious here and that is division by zero and pandas gives infinity and spark it gives null which is the more equal like version of that but just keep that in mind there are some differences you wat Silla neighs also again very similar although the pandas has a lot more options for this spark is a little wacky not just you may be able to recreate some of those more advanced filling options with fancy sequel but a lot of those things that deal with say time series it's going to be really hard to do if you want to do aggregation you can do that exactly the same at least in this syntax option there are others and tax options that don't act line up exactly but you can do it it with the same syntax like that okay we get the point right let's see what else is is a little bit more different in your transition okay so let's say you want to make a transformation do some standard transform on a column they like do the log of some value okay in panda your go-to thing for that was to use numpy you could use numpy to do the transform in pi spark but don't you want to absolutely use one of the built-in functions in pi start so import this pi sparks equal dot functions you know give it some short name and your got reuse it a lot everywhere and it's going to have all these common transformations that you're going to want to do every day and I mean lots and lots of these ok there's hundreds of them and there's even some that aren't even in there but that's a different story why you want to do this is this keeps the compute that's happening on your data in the JVM that means you are not actually running any Python at all on your executors what does that mean means it's that if you really need Python slows it down a little bit okay so just to make the point you can do some pretty complex stuff with those built-in functions like say you wanted to apply some complex predicates to each row well you can do the equivalent of that in PI spark using those functions there's a win that can be chained basically if else if else conditions and do the equivalent of that but sometimes you really just have to apply some Python and if you want to I mean obviously this is in a really valid example we're just adding one but pretend that's more complex okay and we want to apply that to the some column in our data frame you can register it as a UDF as a user-defined function give the return type of that UDF it's very important that this is actually the correct return type otherwise your UDF will just give you null every time towards ADD and then you can apply it just like any other function in there it is important that your UDF be deterministic that is because spark might evaluate it more than one or less than one for each of your rows it might it may do optimizations where hey I'm going to apply the same function to the same value a bunch of times and only do that computation one so it has to be deterministic okay say you want to sum join up our merge in the pandas vernacular some some data frames the terminology is a little different in pi spark it's only join in pandas if you recall join is for when you want to do a merge on a common index you don't have n to C's so there's not that difference it's all just joins and if you want to join on say your left and right keys a little bit different name you can do that just just like before but it's a little bit different syntactically and of course they all support left right full outer joins default in or join of course but it's all the same a quick plug you can also do pivot tables just like in in pandas the syntax is a little bit different yeah I only plug back because that's one of my contribution so okay summary statistics you know is this is something that you want to do all the time on your your data when you get something new you do not describe well you can do that in PI spark you're going to get a lot of the same things but unfortunately it's only going to be the main ones of count mean standard deviation min max you don't get the quartiles you can get the quartiles just takes a little bit more code you can use the built in function percentile aprox which built in but unfortunately it's not in that functions import that we did earlier like I said there's some additional sequel functions that are built in that are not in that list for some reason but you can do it as you select an expression and you put in that sequel expression of percentile procs the name of your column and array of the percentiles that you want in this case the 25th the 50th and the 75th percentile I am yeah yeah I mean you can you can do that fairly easy as well let's talk after countin it okay histograms pandas makes it really easy to graph things pice mark unfortunately doesn't but you have an escape hatch and that is the wonderful two pandas method which converts your PI spark data frame into a panda's data frame and then you can do all the things that you could do and pandas before however you're not going to want to in general convert that all over because probably your PI spark data frame something that's huge and if you try to put that all in memory you run out of memory so general you want to do a sample or a limit before you do that so that you don't have that problem also if you really really want you can do exact histograms probably don't need to you asked us to statistician there guess a no you don't need to do that this is good enough in general okay okay sequel panda does not have sequel support there are third-party libraries like why hats panda sequel but it's built upon sequel Lite which if you're not familiar with it is a bit more limited in sequel syntax support whereas pi spark has very very good sequel support including like analytical window functions and all that kind of fancy stuff so there's lots of cool stuff you can do and the other part is you can switch back and forth between sequel and data frames that is like in this example okay we are registering a table so basically I have a data frame I give it a name that can be accessed in the sequel world I do some sequel a select statement I'm not really doing much here but you could do any arbitrary statement and you can assign the result of that sequel to a data frame remember this is lazy so it's not actually doing anything yet okay it's just making that as a reference and you can then continue to operate on that data frame just like any other data frame and it's going to fuse all those operations together at the end when you actually try to output it because it's all really one unified API all right so so let's talk about some some best practices okay make sure to use those built-in functions they will make your life a lot better yeah save a lot of time okay use the same version of Python and all of your packages on the clusters that you're running on as the driver okay if you if you don't ensure that this is the case you can run into all sorts of crazy problems when you're actually you know go to run that Python code that has to execute on the cluster and things just don't work for some reason it's it's pain so one way to ensure that is to use Khanda environments for example you can you know create a Conda environment you know save it export it out and you know make sure that you have that same one across the cluster check out the the built-in UI for spark it's got a lot of information about what's running and can help you know debug in some instances where you know things aren't working out right if you're running this on a cluster and you want to do notebooks you may want to learn about SSH port forwarding this can allow you to run notebooks off on a cluster when all you have is SSH access to it you can forward those ports over to your local computer and still access that notebook another way to go about this is Jupiter hub if you have you know more people and someone to set it up check out spark spark ml Lib it's sort of the equivalent of scikit-learn odds-on PI spark gives you all that machine learning libraries at scale and finally read the fantastic manual it's it's really not that bad it's not as great as as pandas but it's there all right what not to do don't try to iterate through all the rows this works perfectly fine and pandas it's not the most efficient thing but you know get the job done you can't do that in spark it just doesn't work when you're writing these things as a script don't make the mistake of trying too hard code a master do that in the spark submit command-line and that then your scripts will be a lot more versatile you can run it you know however you want if you're going to do filtering on something that you're converting to pandas do it before the conversion not after if you you know do that first thing it probably won't run because you're doing the filtering after okay what to do if things go wrong don't panic read the air sometimes they're good sometimes they're not but definitely Google it because someone else has ran into it before most likely there's active Stack Overflow page with tag Apache spark that you can search and ask questions on and there's also an active user mailing list that you can search the archives of or ask new questions on as long and if you find something that you really think is a bug please just go file a JIRA ticket the worst that will happen is someone will say no that's not a bug do this okay all right that's it remember I have another talk later this afternoon if you're interested in graph Algrim thank you [Applause] hi thanks for the introductory talk it was very very useful for me and I have a question is there any reason I'm a data scientist is there any reason I should invest time in the Scala spark spark a lot Scala okay or I focus my efforts in PI spark so there's a couple aspects of that one is preformance if you care about performance and you want to do a lot of custom stuff that's not built in if you're using Scala or Java you can get a little better performance but if you're using all the built-in stuff the performance is exactly the same because it's actually being executed exactly the same the second is a P I completeness unfortunately there are some cases where the Python API lags a little bit behind the Scala API like new things are added like a new algorithm and ml Lib is added but no one added a Python wrapper for it sometimes that happens that's not very common most of the times is equal but there are some instances of you know the Python API is a little bit behind they usually resolved fairly quickly so I in general I don't think it's worth going to the extra effort of learning a whole new language just for those small little things I I think it's a much better investment of your time to just learn PI spark and and go go with that yeah so thanks thanks for the presentation I have a question about Weisbach visualization and correct me if I'm wrong so are you suggesting that the only way to leverage pandas or other basic bicycle visualizations to do sampling and then transform this type and data frame or the spice pork have some visualization libraries okay so there is nothing built into PI spark if you're using some extra tools like say data bricks they have a they have visualizations that can run on your full data frame but in the general open source sense of just what's included in PI spark there's no visualizations at all but generally speaking when you're looking at a large enough data set there's not any meaningful difference from a sample of that in the you know the graphical representation and the full data set this isn't universally true but for a lot of visualizations there's not like a real meaningful difference and so I'd say and usually okay and if the the other thing that you can do is if you want to do this on the entire dataset is you can do those aggregations on your spark data frame because any general your visualization is of some sort of aggregated data you do those aggregations to you know essentially shrink it down to something that can be done it fit in memory and then do the visualization off of that hi thanks for your talk I'm new to Python and scikit-learn are there any good practices given that scikit-learn is probably a lot more richer in the terms of what it can accomplish in the algorithms and stuff so is there a best practice you follow that you do some things purely in scikit-learn and then you do some things in pi spark well I'd say the the main the main difference is like how big of a data set you can work on if if I had infinite memory I would just do everything inside Killarney oh and an infinite compute right but there are these these realities that we run up in again and I think I think the the distinction is really just when it becomes hard to do with traditional tools that's when you start to look at these distributed solutions like PI spark so I had a question about writing UDS and Python and when you write a UDF in Python the insight I haven't fully gotten is like you know what sort of what are sort of best practices for writing Python UDF's like the only thing I can think of is like use list comprehensions when possible and these types of things right do you've any thoughts on kind of best practices on how to write a less painful UTS like less slow idea so one thing that you can do is like if you're doing some sort of numerical thing just like before use things that run native code on the back end like numpy but in terms of in general like Python uses I mean it's not really any different than just what you would do in pandas on unfortunately there's no there's no magic bullet you you can now yeah so I guess just a clarification on that their ship there's no there's no any Java optimizations really that happened right if you yeah so what happens when you have a Python UDF is on your executors they spawn a Python process and pipe the relevant data to that Python process and run your Python code and that gets piped back so that's you know where that inefficiency is if you're not running a Python then everything just stays within the JVM and you know of course it's not but this boils down to if you profile it and Python yeah it'll be roughly the same yeah yeah okay thank you all right let's think andrew once again [Applause]

Info

Channel: Databricks

Views: 103,033

Rating: 4.9448566 out of 5

Keywords: apache spark, spark summit

Id: XrpSRCwISdk

Channel Id: undefined

Length: 31min 20sec (1880 seconds)

Published: Wed Jun 07 2017