Advancing Spark - Understanding the Spark UI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello spark fans and welcome back to advancing spark so we've had a few suggestions saying that you'd really like it if we stepped into the spark ui so this time we're going to do a bit of a what's happening in there what's going on in my cluster how do i use the spark ui to try and do some troubleshooting and we talked about one or two of these things before so i'll do a quick recap of what's going on inside spark and then how you use the spark ui to diagnose it to figure out what's going on to use it in the middle of your queries so don't forget to like and subscribe let me know if you've got any questions down in the comments and otherwise let's have a bit of a look so starting off with a bit of a refresh how does actually spark work under the hood so we've got this thing we've got a driver and we've got some workers depending on how you've set your cluster up you'll have a variable number of workers that are different sizes now the way spark currently works is that these are all based in jvms now we know databricks is working on this photon engine which is all c plus based and it's all a whole different thing currently the way spark works everywhere or photons rolled out is this jvm so we've got a java virtual machine which in databricks is one-to-one with each of our workers now inside each jvm we have an executor this is just the spark program running in there and then inside that we have a number of slots a slot is an available bit of compute it's a cpu so if i've got like this i've got four executors that i've got four cpus each i've got 16 mass 16 slots available for me to do work on so the size of your worker how big each worker is therefore how many cpus it has changes the number of slots and the number of workers changes the number of slots that makes sense each slot can process one unit of work at a time before we get to the photon stuff so that's current that's how our cluster works uh when we're doing a job dataframe.read all that kind of stuff that has transformations in there which can be narrow or wide and the type of transformation that we're doing will denote whether we need a shuffle again so a wide transformation is something that can't be done internally on a single worker it has to take all the data off the workers shuffle it around change how that data is chugged up put it back on those workers again stuff we know we've got actions that trigger that stuff so we tend to see this kind of thing so if i'm working with a particular set of transformations and i say go go run that action go write my data or display my data do something with my data it'll then traverse back up that list of things i've asked that data frame to do all the different transformations the read activity etc and it'll work out and it'll make a query plan and that plan any time it's having to shuffle data anytime it's having to change the number of rdd blocks my data is held in across the workers this is a shuffle called an exchange in that time to take data off things move around put it back on to other workers and you'll see that represented so you'll see i've got a number of tasks and another number of tasks another number of tasks representing different stages in my query and there's a few bits of terminology in there so we've got this whole thing as a job so if i hit run on something saying data frame display or dataframe.right if i hit run on a action that's going to trigger a spark job or multiple spark jobs depending on what i'm doing and that's going to result in some kind of activity each time there's a shuffle so a job is broken up into these things called stages each time it's a shuffle it's a new stage with a new number of tasks which represent the amount of blocks of data that need processing inside that particular stage so if i'm saying i want you to go and add up all of my data there'll be an initial stage which is reading the data in then another stage which is each of the workers kind of doing its own uh additional counts if we say it's accounts another stage which is collecting the results of all those accounts onto another worker to actually sort of collate the results and there may be another stage sending the results depending on where i'm putting that result so each different element of the actual query that requires an exchange of data we'll see as a separate stage now if we're using a load of libraries especially like machine learning data science type libraries even with things like delta they might be doing some kind of pre-jobs post jobs like reading transaction logs updating states all that kind of stuff so sometimes it's a little messy in that you'll run something and you'll see several spark jobs but usually one of those is your actual query running okay so we've got that idea so we've got a cluster with workers slots we've got a job with stages and tasks again as we're running a particular job so when the first task go on that is to allocate slots so a task fits neatly to a slot where i've got more slots available and i have tasks then i can just i can be running multiple jobs at once so parallelism is managed by saying what slots are free what tasks are free i've got two queries waiting to run that one needs 200 slots that needs eight slots how do i fit them together so neither one is blocking too much you get that kind of activity going on inside data bricks and then again i'd move to the next stage all those take it up as soon as the next stage i'll go across that's the kind of activity we're going to see and that that having that kind of cool thing in mind is really important for understanding what's going on in the spark ui because that's how the sparky was structured we've got tabs saying what are my jobs what are my stages what's going on in my executors these are all elements of the spark ui that essentially relate directly to this stuff so let's switch over okay i've got this is actually from when i did the uh the aqe demo and there's some deliberately some fairly gnarly queries in here and there's a few different ways we can go and say what's going on on this particular um cluster we're going to leave that running that's going to do a lot of stuff we can see there's some spark jobs happening you can see they have stages they have tasks so that says i have eight tasks i want to run as part of this particular stage so we're not going to break down it's not having to do any shuffles which is great but saying i've got one job that job consists of one stage that stage has eight tasks and i'm running on a fairly small cluster i think i've got eight uh slots and you can see it's currently running eight eight of eight so that is how many tasks are currently running and how many tasks are there in total for this stage so i'm running all of the tasks currently in power so that's great that's nice and efficient and that's going off doing that so we're going to leave that running for a second and then in terms of how we get to the spark ui there's a few different paths so we can either we've got this but that view button which takes us to like a cut of the spark ui for that particular job it's like a little shortcut to go in there and work out what's happening for that job in isolation so i can see this is the job it's now succeeded i can see it's going through and i can see what was the actual thing i was trying to run how long did it take how many tests were there and how much data did i read as part of it so you get kind of like that high level holistic stuff and we can dig in and have a look at the most effects going on again my biggest biggest thing if you're looking at a query trying to understand why it's taking so long what it was doing if you're trying to look at the execution plan and going well that's fairly complex i don't know what's going on in there you've always got that associated sql query which is kind of showing you the catalyst engine representation of what that was doing and that little hyperlink that kind of 80 is going to take us to just super useful graphical representation this is what's happening inside that particular query so if you ever want to know what's going to go so i can see it's taking a while because it's trying to get um it's generating 30 million rows as part of it project basically a select is deciding what comes it wants and then it's registering hive that's what it's doing and you can go and understand why a query is working a certain way by taking that path so under the particular job if you expand that job you can do view you can see how many stages it is directly if you want to know what's happening inside those stages hit view get that pane up and then to really understand it you've got the sql query or you can drill down into the later bits that's one route of getting to the spark ui the other route i can do the drop down here so i can say well actually on this cluster and i've got a link to the spark ui there so i can go directly to it and that's not going to show me the spark ui for a particular job that's going to be the general spark ui for this whole cluster and finally i can get to the same page if i go to the clusters and i click on uncertain cluster i've got spark ui as one of the options here as well there's a few different paths you can take to get to the spark ui depending on if you want to go straight to a particular job or if you want to see what's going on in an entirety on my cluster and let's let's try that out actually let's just if we go and grab something let's go to do a quick delta thing i'm just gonna run that uh attach it to my cluster i'm gonna have that running i don't know if that demo is gonna work but we'll see there's a few things going okay that one didn't work uh maybe you can run the other order i just run a few things and then we'll have a few things going on the cluster that we can go and try and debug we can go and see what's happening so in my spark ui a few things there's a few things that general config the event log is super useful if you're trying to figure out what's going on on a cluster who started it who closed it when at what point did they close it uh and also really usefully if you've got um auto scale enabled on your cluster you can see it in this event log you'll see cluster resizing triggered by this cluster is idle so i'm bringing it back down the event log super super useful for keeping track of that stuff it also gets a load of error type things so if you're seeing any weird responses because it can't spin up a node it can't get hold of the drive all that kind of stuff you'll get those appearing in the event log hopefully that's really rare so most of the other stuff we're seeing in the spark ui so sparkly is where we're seeing things so as a super high level thing you've got jobs that is what are the individual whole chunks of work i've given to the spark driver not just me anyone using that cluster it's kind of like the execution log of things going in there's a few things in there stages those jobs break into those multiple stages depending on how many exchanges within that particular query and that's kind of that lower grain so essentially stages of these child objects of jobs where we get a better idea of number of tasks how the task was spread across workers that kind of stuff storage is for caching so if i've got a particular data frame and i've said dataframe.cache or dataframe.persist i've chosen an in-memory option then it will store that data in memory across my workers and if i'm curious going ooh maybe they were skewed maybe actually the blocks of data that my data was in is actually massively skewed so one worker's got loads of stuff in memory and the cache is eaten up the other one's fairly empty and i'm trying to troubleshoot how that's going on or even if i'm just deciding whether it's worth me hashing data and i want to check what's already cached if i try and cache your big data set am i going to knock existing things out of the cache and you can see that in the storage layer so storage storage is annoying because storage makes it makes you think that's physical storage that's about files on disk it's not storage it's all about memory caching uh it's been going through what's in memory across my various different workers environments that's more environment config so the state of things environment variables uh what version of spark you're on all of that kind of stuff uh executors that's all what are all my different executes so in this case i've got two nodes in my cluster so i'll see two node records for my different workers and then one node record for my driver i can go and say ones how busy are they oh that that work is working overload that's got tons and tons of tasks waiting the other one's middle i've got sku i've got something going on and i can go and dig into it sql as we saw that's when we get that kind of graphical thing that's essentially kind of the output of the catalyst engine it's how it's interpreted the query and the query plans it's using odbc jbc that's if people are poking that endpoint so if i've got things like power bi or tableau and they're using databricks as a essentially data source and they're throwing queries in i can get some information about what's going on there i'd finally got the new structured streaming ui so if i'm doing some cool streaming e-type stuff i can go in there and see what's happening that's really new collects loads of loads of really useful info super useful but let's do a more detailed dive so i'm in jobs i can see a few things so i can see there's one job currently running so i can see something is running how big is it it's got four stages okay it's got quite a few tasks uh only one of my stages running okay maybe does that make sense why is that taking so long so i can dig into that and say what's going on so we'll do that in a second i also get completed jobs see failed jobs i think if i kind of uh fail jobs so you can kind of just have a big really quick high level overview what's currently running what succeeded what failed so if someone's saying oh i had an error on the uh on databricks can you come and help me out you could hop onto that cluster as long as it's still turned on go to fatal jobs and get the other and go okay so something failed what is happening oh okay so my tasks failed and then you can drill down and you can go and see it you can get the errors and all that kind of stuff going just say what happened on this particular thing get the failure reason okay there's a problem like okay i had a java character again that's just a weird thing i'd like to run get lots of info from there so the debugging failures and drinking down to user activity normally do through that kind of thing if we go into our one that's currently running i'm like well what's why is that taking so long that's been going for a little while why is it taking time to do that thing i can drill down on this description this will go give me kind of a job page which gives me more detail for that jobs so i'm still in the jobs tab here i'm just kind of inside a particular job and again that's where i can go and have a look at that sql query to find out what's going on you can see how many stages there are so i can see it's currently in the middle of running one stage there's another stage that's yet to run and it's completed two stages already so i can say kind of if i've got that query planned i don't know there's a small reading stage a bigger transformation stage and then something where i'm doing something fairly hairy and then another stage i can kind of go like what stage is it at okay so it's trying to actually collect a result so trying to do one of the steps of my things what's actually going on in there okay i like to display my skewed data frame i can tell you there's some details in there i've got something deliberately skewed it's taking a while so you can go and see i'm gonna get a bit of a thing there isn't that much useful stuff in there unless you really like going down the deep levels but why is that keep one task that's still remaining that that feels a bit weird you know so normally you kind of see it kind of chunking through it's just working through the test at a regular level i can see at this point there's one running task it's gone through 199 of them been going for four minutes that seems a little weird we can drill into the details for that particular stage so i've got a stage here this is my stage id drilling into this one will take me into the stages tab so i'm then having a look for a particular stage what is going on on that stage so total time across all tasks let's see what's going on reading a fair amount of data so you get some high level what is actually going on for this stage you can go to the parent view i've got some different views about what's going on get that kind of what's going on here so i can see it is actually doing shuffles so this is the rdd visualization just can be a little bit harder to read than the sequel one that's associated to the whole job but you've got the secret one for the job as a whole you've got the dag in this case for this particular stage and it depends on the level that you're trying to understand stuff i get some general high-level stats and then i can see for those 200 tasks so we knew as on number 199 out of 200 i'd say what's going on on all of my different tasks and i get this list i've got all these different succeeded tasks and say how big were you so most of them took three seconds three seconds three seconds two seconds all nice and quick and then there's one okay so most of them are reading 12 mega data most of them are really really small and there's a beast there's one rule of data one of my 200 chunks of data is actually i put a gig so far that did for ed so i've got massive sku there's just one particular task it's way way bigger than the others so i know i need to look into that and try and figure out how to resolve that skew so you've got that whole idea this whole thing in the spark ui of going from jobs top level what are all the things that are happening a job detail what's going on in this particular job what are the stages how long has it taken how much data am i reading down into the particular stage with some high level stats about that stage right down to the task level so in this stage i had 200 tests 10 tests whatever it happens to be how big was each task how where's the sku where's the bottleneck i can see there is a problem here and i can i can then do some a bit of profiling i can then say well actually just show me how many records what am i am i doing a re-partition on a certain column that's causing this skew in which case i can change how i'm doing it so that whole idea of kind of this drilling down and down and down in terms of the depth of the last it's got really interesting to understand what's going on in spark ui so super useful first query debugging jobs job details stages now storage two things we use storage for as i said there's caching i've not cached anything currently if we did we'd see an rdd here with a basically a query plan because um when you cache something it's not like so you had sql server and you said just store this as a global temp table and it's cached and this one refers to something with that name it'll go and get it back that's not how caching works in spark caching works based on essentially the execution plan so if someone else writes the exact same query writes a data frame against the same location with the same transformations and all that kind of stuff it'll actually reuse that cached version so caching persists across the cluster um when you restart the cluster it'll it'll disappear if someone uh tries to fill the cache it'll drop out the kind of the one that was accessed the longest to go um so kind of first in first out style well no last accessed first down style caching so it's not always going to be there but if you do cache something and then someone else runs the same thing in a different notebook in a different spark session but on the same cluster it will reuse that in memory cache once you be a little bit careful to make sure you're not accidentally doing that if you don't want to and then it has this thing the delta table state so that's i've accessed the delta table and spark has gone off it's read the transaction log and let's just cache that transaction log so rather than every time when i access that table it has to read the trends log work out what to do and then run the query if it just persists this thing packs because it's tiny what the number of bytes or kilobytes they've got a tiny amount of data in here just to save that little bit of lake access just to speed that up a little uh and also there's when we've got checkpointing and all the kind of new stuff that's coming in to speed up the delta state access that's again going to be improved in terms of how long that takes so storage all about caching you can see how much data i've currently got um cached so you can see not that much i'm not using that much if i had a big one here i'd see how many partitions and you can drill down again this won't tell us too much on a delta state object i can see there's just one rdd block and see how big it is and i get an idea of which executable it's on so my horrendously skewed data frame if i catch that i'd see 200 rdd blocks and i'd see one of those rdv blocks which is that massive six gig one is on a particular worker so actually the cache for one of my workers would be way way way more utilized than another worker and again you can see that kind of thing coming into storage looking at the various things that are cached and drilling down onto the one that's a proper the other way of seeing that is fire executors so we just switch over to the executor one go back to environment you can see what's going on on my cluster so i've currently got three active nodes okay one driver two workers so your driver always appears as one of those active but how much storage memory do i have across my various workers so i'm using 20 meg for the delta cash so i can see i've got 10 gig free of available cash that i'm not using so again if you're looking at this cluster and people have been cashing stuff you can go on there and say how used is it is it is it free should i start caching stuff because i've got a load of memory sitting there not being used or actually oh no we've got a load of stuff that's reserved in memory cache i don't want to kick that out you can get that idea if something's going really really slowly it could be because it's trying to get it into cash and having to put something out and then put it in and then put something out and put a bit in and that could be a reason for why something's going slow because your storage memory is full on that digital workers maybe you need to have pick a vm size for your worker with more memory to cpu or have more workers and spread the work out again you can see total utilization there so i can see i've had some failures i've got some active ones on my individual executors i can see what's going on okay so actually my cluster's currently fully utilized both workers currently have four active tasks meaning all four slots on both workers are currently full so if someone else wanted to run a query and it was sitting there waiting going can't run you're waiting for space on the cluster it's because the cluster's currently in use so if you're using a cluster for concurrent executions and you're wondering why is it taking so long you can go in here and go oh something's currently running my cluster's full and then how do we find out what's running we go back to jobs and we can go what's using up those uh eight tasks okay there's an active job oh and that's still my skewed one but it's gone past the other bits so it's still actually eating up stuff so you can use executors to work out kind of holistically how busy is my cluster and then jumps to say what's actually running on it so again all these different panes give you a different viewpoint to what's going on on the whole cluster and again this is cluster specific okay so loads of stuff we can do a lot of stuff we can understand sql again this is kind of showing us that kind of uh user execution just a different way of seeing what's going on anything that's gone through the sql engine we'll see here and we can drill down so it's related to a job so we can see this is still the same job i can go back to that uh job pane or i can click into it and say what are you actually doing in here and i get this kind of the sql query style plan so we saw for that stage we had the dag that's just specifically in that stage that was kind of explaining this exchange so what we're saying earlier when we said that each exchange denotes a different stage in this one because this is the whole job we can see what those exchanges are we can see when it's doing a shuffle so i can say well like okay this is doing some work so it's scanning parquet bringing in it's doing some filtering and there's projecting it it's doing that on two different bits of parquet so essentially i'm doing read that file as one stage read that set of files as one stage and i'm heading through an exchange i'm having to put those down chop them up put them back on the workers in a different um state then home to a sort merged joint that can perform quite badly but you know that's i can troubleshoot that i can understand that kind of makes sense why there's a problem with skew in there and then i'm doing an aggregate so then each worker is getting the results of the aggregate and then sending the results together so i can collate them which is why there's another exchange and again it's this nice view so if you're trying to understand what's going on in in a given stage you can use in jobs drill down to that stage look at the dag to understand what it's actually doing in fine detail you're trying to work out just why i got four stages why is it doing so many shuffles all that kind of why is it doing what it's doing that kind of sql view gives you the overall query plan for what's going on um you could get the same thing by just doing dataframe.explain and you get the text version but this is kind of nicer to see and you get the things it's already done it's gathered some stats while it's doing it you can actually work through it so you get the results as it's going through so you can understand what's going on inside this query so sql super super use um i don't have the jdbc running at the time this gives you session information so essentially it's kind of like if i don't know about a user because it's kind of coming in externally i can see some stuff i can see and i think it's doing see if i can see any stats it's gathering i'm expecting this to get a lot better with some of the redash stuff as well so when we're kind of starting to see databricks with its own kind of um dashboarding engine all that kind of stuff we're going to see some cool stuff hopefully coming in there and streaming i don't think i've got anything streaming apparently no i'm going to do a separate session on the structured streaming ui because it is quite new and it's got a lot of cool stuff in there so we'll have a look at streaming and we'll have a look at some of the different things we can do in another session so that's the basic spark ui that is basic there's a lot in there there's a huge amount of info to get started on spark ui but lots of things for different scenarios there is a historical cut so if i've shut down my cluster i can go in here and kind of get an idea of what's actually what happened historically i don't know many of these will still have the history but we can try there we go so it actually goes off it queries the old log files uh and then it will give me a view of what's happened so i can actually say okay i don't have much actually running on here i can go and do if you've been running on it fairly recently you can go and dig into your history and then get an idea of kind of these things so even if you happen to have shut your cluster down you can get some of the logs it's not as fine-grained detail you don't get right down into the depths it's not useful for querying what's happening apparently because your cluster's turned off uh but there's definitely some useful stuff in there now finally look at metrics now metrics are super useful and we keep getting that question of someone going how how do i know if my cluster's being utilized how do i know if i should have a bigger cluster i i've turned the cluster on and i'm using it a lot but are we actually using it or should i actually turn it right down can i have a slither of that cluster open and that's exactly what gangly is for so you get a load of kind of historical metrics files you can dig in and you get this kind of like pdf of this is what's going on on my particular server so you get cluster load you get memory usage you get cpu and yes it doesn't look that pretty but you get actually a lot of information about how your cluster is being utilized so if you're looking at a cluster that's been turned on and it's had a whole day's worth of activity you can just go into your metrics on a particular cluster and you've got these hourly or 50 minutely snapshots giving you that viewpoint throughout the day of how busy it is so if you get a load of service requests and you're looking at kind of when do people complain everyone complains about lunch why are people complaining at lunch you can go and have a look at the metrics and then say oh suddenly oh everyone on their lunch break was just having a play and testing out the new functionality and suddenly they've battered the cluster and you can see that by going into the historical snapshots you have the live gangly ui as well opens up a different dashboard this is much more interactive you can go and drill into what's going on go and see different things you can go and change the different things so i'm looking at an hour currently i've got two hours depending on how long my cluster is it's only been turned on for two hours um you can go and explore this to get a huge absolutely gargantuan amount of data about your cluster running telemetry so if you want to know how your plus is being used whether or not your executors are completely skewed whether people are hammering it during the day you've got a lot in here so this is just your main view there's a ton of stuff you can specify one of your given workers if you think there's a worker that is having particular issues uh and there's a whole lot of extras of those built in not all of them uh work we get a few odd things in uh databricks but certainly that kind of initial snapshot of what's going on on my cluster this is so incredibly useful so if you're not currently using gangi if you don't use ganglion didn't even know it was there uh it is a little bit hidden away so on your cluster go to metrics go to the ganglia ui and then you can see that telemetry now see if that telemetry gets kicked out to log analytics you can get some of that after the fact but certainly if you want to tweak or play with the seismic cluster what's going on all of that kind of stuff look at ganglia ui go and have a play get some of the stats try and run a few jobs during the day go back afterwards and see if you can find out where those performance bottlenecks are why they were happening what's going on there uh and then that is just a super super useful thing to be aware of i'm kind of sad i left that to the end of the video so you know if you made it to the end well done you've got ganglia you can go and have a look at it um if the other guys didn't they're just going to have you know poorly performed in clusters sucks to be them all righty so hopefully that was useful i know there's a huge amount of info there the whole thing is getting used to that little hierarchy of going what's going on my jobs what are my job details how many stages did i have what's going on in each stage what's going on at the tasks if you understand that hierarchy of how spark is breaking up work osbox thinks about taking data turning into a number of rdd blocks which translate immediately to tasks so if i read a load of data that's 20 tasks that relates to 20 slots i need 20 little bits of cpu in order to process that data if you've got that right in your head and then other people are running things going why is it only doing two tasks at once so i've got an eight node cluster you can go well someone else is running a query and they're using six of the slots getting that idea of you have a number of cpus how many cpus is each thing working that is how concurrency works that is how you manage different things that is how the databricks job scheduler just fits kind of neatly maps different people running queries to each other so kind of chick saws varies together so it can be running things at the same time just by using a variable number of slots so super useful spark ui loads of stuff in there by far the easiest way to start working with it is to have a go just get in there try and debug some stuff leave a long running query going go and have a look and see if you can find out what that's doing um and yeah let us know how you get on you've got any questions about it anything that you wish was in there anything you don't know where to find then drop us a note in the comments and we'll see what we can do to help and otherwise yeah don't forget to like and subscribe and we'll see you in a future video cheers
Info
Channel: Advancing Analytics
Views: 10,674
Rating: 4.9583335 out of 5
Keywords: spark, databricks, data engineering
Id: rNpzrkB5KQQ
Channel Id: undefined
Length: 30min 19sec (1819 seconds)
Published: Thu Sep 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.