A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good afternoon and welcome to the last section before we actually have a break my name is Jules Don Jia I am the developer advocate for hope for data breaks and since the moderator is not here I'm gonna introduce myself in third person because I'm supposed to be emceeing this particular track anyway so today's tack as I said I'm gonna introduce myself in third person Jules Damji is develop evangelist with with with data bricks and also a devil advocate he had the good fortune to working with the creators of patio spark and prior to that he was working at Horton works and as a developer with all these different companies and his good fortune to be an evangelist and it's not a title that really matters what really matters is a way of life and I really enjoyed you I'm the guy on the left hand side don't worry about the one on the right-hand side I going out talking to developers and and advocates so why you're here today you know you're probably asking yourself why you heard today so today's agenda is essentially I'm going to cover the three different api's that have sort of evolved over a period of in a couple of years a lot of developers wonder you know since RDD was the one fundamental abstraction for for Apache spark then came data set data frames then came data sets and which one should I use and what are the merits and what are the pitfalls so hopefully about the end of the end of this particular talk I'll have convinced you that they're not mutually exclusive there are scenarios where you want to use our ideas they're scenarios where you will use the data frames the data sets and why structure is such an important element in spark because allows us to sort of look at things in a very structured manner we as developers and we as people in general have the innate sort of intuitive to look at things in structure we look at data in structure we look at tables instruction will good columns in structure so it actually made a logical extension sense to to actually have that view into the data structures in spark and then finally if I have some time we might have demo and then I'll ignore all the questions that you actually have for me anyway so you're not going to hear about the tale of the brokenness of the three kings of the kingdoms but what do you hear about is the api's which really make developers really successful how many of you actually have heard this particular book called the the developers are the new kingmakers anybody have you read this particular book the premise of that particular book is that today because of the open source movement developers are really the new king makers and one of the ways one of the keys to the real um of the kingdom of developers api's they had to be simple they have to be declarative they have to be intuitive and that's what really makes any political platform that you actually use today is the developer api's so that's actually an important book and you'll find it quite interesting so let's jump into the first set of api's which are distributed data set which was the fundamental abstraction which entire the apache spark was developed what is an RDD really it was this logical abstraction on wit it was the spark was initially created so what are our duties they actually have a certain characteristics the first one is that they are a logical distributed abstraction over an entire clusters if you actually have a large data set you can actually think of it as an entire data set across your storage and those could be divided into partitions and depending on the size of the partition some partridges might reside on a couple of executors the other partitions might decide elsewhere so think of it as a distributed logical abstraction on which you can actually write your lambda function you can write your compute function you can write your queries and they will execute in parallel on each partitions so one they are a distributed data abstraction and a lot of API is existing there second they are resilient and they're immutable and resilience means that I have an ability to recreate my oddity at any point in time during its execution of the cycle and as a result what actually happens is that when you create an RDD and you perform a certain operation it will get into what operations mean is that you go from one oddity to the next one to the third one and they get recorded as a lineage where I came from so as a matter of fact if something goes wrong I had the ability to recreate myself and they're immutable in the sense that when you make a transformation the original Ally remains unaltered it's exactly the same and hence you actually create what we call a cyclic graph so that at any point time I should be able to recreate my RDD so to their resilient inmutable that's the second characteristic the third one is is the compiled type safe you have rdd's of particular type you might have an RDD of integer you might an oddity of boolean and as a result of that it gives you the confidence as a developer that if you're gonna write a particular complicated function and you're giving a wrong type of data that's going to give you a compile time error and that actually saves a lot of time early in the code because your spark application could be quite complicated and debugging in a distributed environment can be quite cumbersome as a result compile type safety gives you the enormous benefit third or fourth data could be unstructured or there could be a structure and the hair is dumb examples of unstructured data where you can actually extremes of article coming from media social media or you could be a log file now if you look at that particular log file you might think well there's a sort of a semi structure in there you know I have a line of log file that actually has date and it has the type of operation that was in perform and I have a URL isn't that pretty good type what our DD doesn't understand there different kinds of type is just going to be a particular string but it's up to you as a developer to actually go into that and and and parse that out so they have both unstructured and the semi structured attribute here's an example of what a structured data might look like after you parse that particular Rd and broken down into its respective columns that have a particular type and then you might create a table to looked at that so forth fourth attribute is that they are structured or unstructured type and they have way to parse them the fifth is they're lazy what I wouldn't buy lazy I don't mean lazy in a human slothful manner they're lazy in the one that they don't get materialized till you perform a certain action on the particular RDD so over here you can actually see that the transformation actually happening across about four different rdd's and the T represents transformation in a is in action and action is where your entire chain or if you're a cyclic graph is going to get executed so every time you perform action all you're doing is you actually recreate an entire crowd and that as a result of that spark has the ability to say well you're creating all this different transformation what is it that I can do to arrange them so I can coalesce sudden transformation that actually makes sense so I can take a map and I can filter and I can put them together in one particular stage now just do an action when there is some sort of reduce by cue and there is a shuffle going on so that's how SPARC internally materializes thing in in in the lazy manner that's what I mean by lazy so they're fine so as I said I alluded to other transformations and what our operations you know spark takes a particular cyclic graph and if you're transforming it goes to the lineage and creation lineage and when there's an action that's what they happen in here's an example of some of the transformation that you that you might perform a very low level you don't you don't expect you to know all of them but that gives you an idea that gives you a flavor of what they are and these are some of the actions that you can actually perform an i/o both of those together sort of gives spark the Lola will appear and the power to the developers edges such as such as yourselves so you might ask why do I actually use rdd's why is it it's so important that I use it we they're all oppressors are control freaks how many fear control freaks here I see only two hands that's not very nice Wow we got minority of you so sorry if I offended you but yeah developers like control they want to know exactly what you're doing and they love the idea of being flexible so RTD gives you the low the control of flexibility and control of knowing know exactly what's going on they provide you a very low-level API you know it's like a writing assembly language anybody written in assembly language you might be dating myself anybody to people all right brilliant four that's nice all right so it gives you that low-level API because our these are the lowest level at which you can actually do things so it gives you a low level API they're very type safe as I say you know compile type safe it's actually important if you are a programmer and you want to save a lot of time you want to people are writing compile safe languages they like the fact that the type safety is an integral part of the platform and more importantly they encourage you to look at things in a manner because we like control how to do something in other words you're telling spark how to do something not what to do and I think that's a fundamental difference I want you to take away from that that Rd is sort of gives you the control and there's Ridgid of control you have the ability to tell spark exactly how to do any will do it's like garbage in garbage out right so here's an example of what I mean by how to do something there's a bunch of transformation is actually happening over here so if you look at this particular code you're actually saying that I'm reading a particular text file and when I read the particularly large huge tech media text file this could be gigabytes or petabytes then I'm little split up into partitions across and I create an R DT and this R DT is going to be sort of distributed and the first thing I'm doing I'm parsing the RTD and I'm doing a flat map which means that I'm going to split that entire DD into words and then I'm using a filter to split it up into words and then do a case on it so that if I have four triples in a row which is the project in the page and the number of question I don't care about the underscore which is four scholars saying just match anything and then give me a tuple of those three elements that I care about if I don't see this particular match toss it out right and then I go to the second part of the code which is doing a filter and once I pass that RDD all I care about is the English expedia because the Wikipedia has pages from different languages around the world but for the sake of demonstrations I'm just gonna pick up the English one and what I'm doing over here I'm parsing the filter I'm doing a map again to say give me the I don't care about the first field the page in the number of requests I create a tuple cause page a number request and then I do a reduce by key very low level I'm exactly telling SPARC what to do reduce my key and I'll reduce my keeper number of pages in the number of requests and I add them together and then finally I'm creating an action right which is take and at this point what's going to happen is that it would start reading the file going to go for the first transformation second transformation third transformation for all that particular piece of code how to do something is going to be executed in the stages and the reduced my key will create a shuffle very compact code good code we like to think it's a good code it will do what exactly you're telling it to do so then you're asking you know when should you use rdd's and I said earlier you care about low-level API as you care about control of your data said this is exactly know what the data set looks like you didn't get unstructured data good number of time we're also dealing a lot of structured data so you're dealing with media streams and text you want to use particular RDD you want to manipulate data in a manner that you don't care about high-level functions you writing little lambda functions as we saw in that particular piece of snippet code we write little lambda functions to manipulate the data because we know exactly what the data looks like so you don't care about that and you don't care about a scheme our structure you know some of us love the fact that we come from an RD RDMs background and we won't actually have structure so we don't you don't care about that user IDs and then if you don't care about the optimization that spark can actually provide you because of the new structure that we actually introduce into the dough and some of the performance inefficiencies that you might as a developer because of your hubris think that you actually know exactly what you're doing and I'm going to tell spark what to do and it will do it for you but if you don't care about that then it's fine so then you ask me Jules hang on what's the problem right why why not use our Dedes well the problem is that is that because you're telling how to express something not what to do spark can not really optimize because I can't really pick into your lambda functions to see whether you're doing a join or whether you're doing a select or whether you're doing a filter of the of something pretty good I don't know what kind of data you're actually dealing with and so as a result that I can't really do the optimization some of the RTD is on a non-gm languages like Python are very inefficient and take a long time but if you don't care about it and that's that's one of the problems and I think you can actually introduce unknowingly and unwillingly this inadvertent efficiencies that you think that you actually telling spark what to do but you actually might might introduce some problems so what I mean by that so let's look at an example here is a very innocuous looking code right that you are actually written and you're saying I'm telling spark exactly how to do something and what you're done over here is that we're actually parsing the particular filter and then when you're doing a case on that and then we're doing the reducer by key just like we did before and then we're doing another filter right we're doing another filter to remove if the page is not particularly kind and then we're doing in action does anybody see a problem here yes what you doing over here is we need your reducer bikie you're taking your entire data set across your cluster and you're doing a massive shuffling of data across that thing now spark is not going to know that it's just gonna go in and do it for you right what you really want is even to reverse those two line you want to filter everything out so then you only deal with very less amount of data to reduce that now because of the fact that you're not really using very high level api's spark won't know that yeah that's what you're trying to do your intention is hidden behind the fact that you're telling it exactly what to do and that's where the structuring spark action comes into play that's where structure makes a huge difference that's where high level API is in which you express your computation in inquiry to tell spark what are you doing and it knows and Peaks into your lambda function or picks into your query to figure out what exactly doing so he can do better optimization for you so understand what that means is that let's quick look at the RTD where things can be unstructured we know that RTD is actually having dependencies and we looked at those dependencies by the ability for it to actually create a lineage of the transformation so it knows exactly how to go from point A to point B because I recorded a particular lineage in transformation so there is there is this notion of dependencies there is the there is the idea of partition but it knows about some locality where your partitions are stored so I can send the right code to the right partition to execute your lambda functions on that the third that there is a compute function associated with an RDD which given a particular partition is going to create an iterator and then it will execute that pretty good code on that partition distribute it across the cluster and then merge the results back and that's the fundamental programming model of of an RDD well what's the problem the problem is this this particular compute function that actually creates an operator is an opaque function SPARC doesn't know what it is right Spartan doesn't know when you're doing a filter spark there's no weeding of a selects part doesn't know what you're trying to do a join and and the data they are dealing which is actually quite opaque right I don't know whether you're actually executing accessing a certain column or what database data frame you're actually trying to access so ID since I don't know what kind of JVM object you're dealing with I can't really optimize it so what I'm going to do at least is I'll just serialize this particular part of code and data I'll send it over the executor and then it executes so that's where we're the structured API is come into play and one of the things the structured ApS gives you is this particular spectrum of error detection where early in the code so off on the far left hand side you actually have SQL when you express a query in SQL in sporadic to no though you will get a runtime error if you eat the misspell for example if you say select with EE you won't get a compile time error you'll get a runtime error which can be quite expensive if you are doing a large ETL and then further down the code you're executing an SQL on your data frames or transformation you have done and you get a runtime error and that's can be quite expensive if you move a little bit to the right hand side you had the data frames and the data frames gives you the the compiled type safety because they're all these methods associated with that and if you select a particular method that it doesn't exist or we misspelled it you'll get a compile time error however if you access a particular column in your predictive data frame you'll get a runtime error because it doesn't know it's just a string as far as the the the runtime engine is concerned and then at the far right hand side you have the data sets and data set so I think of it very similar type rdd's but they are they have JVM objects which are really part of your data set and those could be just like a Java Bean right you have the ability to define exactly what huge each and every object in your data set is going to look like and then that way you can actually write functions which give you a compile time safety and whether you do write a function that actually takes a wrong kind of argument type you get a get a compile error so that actually saves you that so these three kinds of api's were somehow converged in $2 and what do we wanted to make sure that the developer situation to deal with with spark to doto and beyond were using only one set of API which are called data sets and data frame in in Scala for example is nothing but just an alias for a data set right because I don't know exactly what your JVM object looks like when I create a data frame I'm just gonna give you a data set of typed row and row is just a generic in Java there's no data frame because everything is tight so you actually get yet the data set and in Python you know since it's it's it's interpreted language there's no notion of data sets the only data frame so what does the data frame API code look like wood is actually a data frame a data frame is nothing you think of it as a table a table actually has a schema associate a minute and the scheme actually has columns in each column we can actually have a type associated bit so now you have your data frame that actually has columns in each column actually has a name and the type associated with it now look at the this particular code is no different from the RDD code the difference over here is in the clarity and in the declarative way in which you're actually expressing your query and oh here I'm taking my original parsed RDD which was which was an RDD and I'm converting that into a data frame I'm selling data frame will actually have two columns page three columns page project page and num request and then I'm gonna do filter by removing the pretty project name I'll do a group bye-bye page which was what we were doing but reduced by and I'm gonna aggregate the sum of all the page numbers and then call that count limit that by 100 so I can just have a small number of display and then I'm gonna show hundred elements now if you compare that with the RDD code you exactly had to know what column or what where in that particular row of your RDD was the page number or en it gets a bit complicated but this is a very declarative and because of the fact you're telling what you're doing spar can actually optimize that because it has a better notion of how to actually create a catalyst optimizer queue that can create a very efficient compact code so that this code will be lot faster than the one you actually had to do with our ID here's another example I can take actually a data frame and I can create an SQL view on top of that they're not going to actually Express the same query that I did with my data frame as I did with my RDD they exactly will give you the same results but they'll be equip the the data frame and the SQL will be equivalent in terms of the speed and this is no different from the previous goal all I'm doing is is doing a select aggregate by count and then showing 100 limit you still don't believe me here's another example where I'm creating an RDD of row type these two tuples and I'm creating that a data frame in name in column and then look at the code for the RTD right I have to know exactly what field is where and it's fairly complicated right if you were to read this particular thing it's very complicated but you're telling me exactly how to do the same coding data frame makes a huge difference I have a group by and then I just aggregate and do a sum so you might ask why who why should I use structured Napier's and I said earlier you know structure gives you the ability to do something why in a very declarative manner because very similar to your relational printers that if you come from database background you're telling it exactly sparc what to do here's an example in SQL which actually doing the same thing and here's the equivalent of what you might do if you were to do that using rdd's so underneath a lot of things actually happening because when you tell SPARC what to do you get this fairly good optimization so what's happening underneath ad that these are supported by the SPARC SQL engine which is built under the catalyst optimizer and the tungsten second code generation that actually gives you a very compact code so your query actually goes through this transformation it has a journey and the journey is is very very simple you got any of these three expressions that we actually show you express your query easing data set the data frame or or an SQL it actually creates an unresolved plan and then it actually checks the catalog to see what columns or what table you're actually referring to and that way once it actually done that it creates a logical plan and that a logical plan goes to what we call an optimized physical plan and it creates up a number of physical plans and then costs associates course with it and says which is the best one for me to actually select that and once it's actually selected that code it does a generation of rdd's now remember these are duties which are generated are different from the RTD area this is a very optimized code so at the end you know everything in SPARC gets decomposed to the rdd's right so our degrees are here to stay they will never disappear at the and low-level API this is the code that's get generated you get very optimized code so let's look at a quick example of what I mean by optimization here is a very simple query the actually doing two joins across and you're doing a filter of two tables and what the logical plan will do is create a tea or tree of operators where I'll read the immense father use the tables I'll do a join I'll do a filter then it goes to the physical aspects of in the physical pain I'm going to rearrange that code because I'm gonna read the scan users first and I'll do the scan filter and then I'll do the join and then when the cost-based optimizer comes you know I can actually further optimize that by saying you know I know you're using users are you know you're scanning users from and from a relational database and I know you're actually using the events from say a park a file and I'm going to push that down to the park a because park is more efficient into the filtering and I'm gonna push that down to an our DMS because they're more convenient they're more optimized to actually do the filtering and once you do the filtering out of the joint so what happens are we essentially is that you reading less data you're bringing up less date of the spark and if you have less data to filter out your query will be more efficient it'll run roll faster rather than bringing all the data up and then doing the filtering in spark and so when you express a query by telling spark what you're trying to do it will try to figure out it will try to rearrange their code how many of your Java programmers I can see JP is one right there he's gonna raise his hand of course now JIT compiler is no different right when you write a for loop in your in your Java code the JIT will rearrange your code because it exactly know how your memory layout is done and just in time compiler have evolved so far that they can rearrange your code you do you think your written the best code but know this and know I can actually rearrange the code because I know exactly where my registers I know where the data where the data types and what you're actually using it so I'll create more efficient code the same technology is the same idea the same strategies applied here when you actually do the just-in-time rtd code generated so what's the data said well data set is sort of a more compact way of saying I'm gonna take my data frame and I'm gonna create a JVM object of each and every object in my data set is gonna look like that particular row so here's a very simple example I'm reading a JSON file that might have people and might have two fields in it and I'm gonna go in and create a case class of type person which will have two fields in it of type name and then integer and I'm going to convert data friend that into a data set so now I have a very typed oriented data set and then when I use my filters I can just use like a java notation I can use in Opa Opa name to compare against an American now if I compare against the string what will I get or we all get a syntax error right compiler because I'm doing something that I shouldn't be doing so that actually catches quite early so what are some of the merits of using data frames and data sets in the pot you spoke to though this was a recent benchmark there should run a lot faster as you can see those the first three green lines are the ones that you use the same query you can see they're quite equivalent I'm using three different languages but because they have been going through the spark SQL engine I get exactly the same amount of performance you can see Python on RDD is a lot faster it's lot slower and the reason is that because when you actually do a Python we pick of that and we send it to a external Python process that's running that's going to DP call it it's going to execute the code and a Reaper colon and send it back and that sort of is the is the performance hit then you'll actually get an even our de Leon Scala is a little faster but the RDD and python is slower because there's an external process that we have to send it to communicate with thee what are some of the advantages of datasets well datasets actually take less memory right in your in the caching because the way we actually create the datasets and how the tungsten cogeneration uses the encoders to to to create that off heap memory so in in caching it's it's it's it's it's more efficient and when you're doing a serialization because you know Java require I mean SPARC does a lot of serialization deserialization when there's a chef what's happening and so when we use encoders which is the the way we actually format memory using tungsten code it actually uses OHIP memory so when we actually do the initialization we can just go across the memory and do it load faster rather than using cry or Java and so on so forth so again back to the point why why should I use data frames in data set when should I use data sets and Femmes if you care about high-level api's and you won't express your query in the domain-specific language you want to use high-level appears you want to use data sets and data frames when you care about trunk type safety when it would Kip you want to pick up the errors you want to compile time errors early in the code you want to use chunk type safety ease of use and readability if you want your code to read well to read easily to know exactly somebody is actually reading the code that they can find out exactly what you're doing rather than the very cryptic way of doing an RDD that's where you want to use it or that's the reason why you want to use it and the most important thing I think you should you should know is that when you really want to tell spark or Express in your query what is it you're going to do you're trying to aggregate something you're trying to do average or something you're trying to do a filter you're trying to do select and so on and so forth this is when you actually want to use the the different data sets and when against where it's no difference it's just a color II you care about structure you scare about data schemer you don't have to worry about inferring you tell it exactly what to do and that's when you actually use structure and and schema when you care about code optimization when you really want to one spark to do all the optimization for you rather than you trying to tune it this is the reason when you actually want to use it and you care about space efficiency and and and the ability to actually do things in tungsten that's where actually you use your data sets and data frames and I think going forward I mean this is what the what the future in spark is going to be like right you attended the talks at the keynote and the keynote everything was sort of built on top of of data frames if you look at structure streaming the weight structure streaming actually works today is that it just uses data frames as a continuous table to do everything if you look at ml pipelines for example the spark ml then pipeline is based on the notion of you actually create data frames and that's the way it before if you look at the new API is that we are actually built those are based on own data frames as well the tens of frames that that we presented at the meetups those were here sort of be is built on the tap the and actually use data frames to use communicate with the tensorflow in order to to compute your Instagram distributed the DL pipelines demo D actually so at the keynote again is actually based on on on data frames as well and so the way future any new things graph friends for example is another example which is really an equivalent and replacement of graph graphics it's gonna based on data frames so any new things that we actually develop will be built on top of the data structure API which is underneath or undergirded by Sparky SQL engine which has made tremendous efforts by using the capitalist optimizer and also by using Tungsten's and the CBO is actually going through incremental performance enhancement in spark Apache 2.2 we released what we call the cost-based optimizer that attaches statistics based on your particular query and if your query is really expensive it will it will use it so you find out what's the best way to actually express this query so visually I mean you know that's what it looks like putting it all together in conclusion this is what the data sets and RDD looks like data sets the universe of everything it has the best of the both worlds some RDS in terms of its compiled typesafe it allows you the ability to do functional programming error type safe manner it has all the great benefits of data frames in terms of it's in relational it uses the catalyst underneath optimizer to make sure that you actually get highly performant RDD generated code it has the ability to use rearrange your code as you actually so in terms of how the operators are rearranged to get you the maximum maximum benefit and you can do the salton filtering as you would actually do in you in your RDD as well i don't have time for the demo we actually have a book coming out shortly called the definitive spark make sure you actually kept tuned and this particular talk was really a vocalization of the blog that I wrote which actually was quite went viral called the tale of three APs and essentially I go more into detail and there is a there is a notebook associated with it actually demonstrate exactly when to use our deities how to use our duties it take a look at that and this is also a combination of one of the talks we actually gave at one of the conferences and there are some resources available over here go for it this slides are going to be available to you maybe in a week or so at the sparks I mean to be able to do that unfortunately I'm out of time but I'm hanging around here if you actually need questions thank you for taking the time I'll answer the questions but I think we're out of time we have to go out for lunch so thank you for coming for the talk and I hope you enjoy the rest of the day [Applause]
Info
Channel: Databricks
Views: 67,561
Rating: 4.9349065 out of 5
Keywords: #apachespark, #datascience, spark summit
Id: Ofk7G3GD9jk
Channel Id: undefined
Length: 31min 19sec (1879 seconds)
Published: Mon Oct 30 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.