Data Caching in Apache Spark | Optimizing performance using Caching | When and when not to cache

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so um hi everyone and in today's session we will discuss about data caching so let's start without any delay let me share my screen and we will get started okay so today's topic is data caching before we talk about data caching how it works first thing is to understand what is data caching and why we use it when to use it and when not to use it so data caching is all about bringing data and pin it into the memory when I say bring data pin it into memory uh I mean you can bring data into a data frame load data from a source or from file or from table load data into the data frame and then pin that table into the memory so it remains in the memory until you want to release it from the memory so you keep it in the memory in the spark executors memory that's about data caching that's what data caching means you cache the data in the memory you can cache data frame you can cache uh tables you can even create a view and cache that is you in the memory and that's what we mean by data caching right and next question comes is when to use data caching and when not to use data caching so uh the most important and the basic reason for using data caching is to gain some performance benefits what does it mean so simple if you are loading data from disk bringing it into memory it remains in the memory and then if you are using reusing that same data you want to read it again or access the same data again you can access it from the memory you uh your Spark engine or you don't have to go back and read it from the disk once again right so you read it once bring it into memory and then if you are reading it again and you want to do some additional operations on that data again and again right all those operations can happen from the memory and they will perform you you will get Lightning Fast performance all your i o operations disk IO and reading and network operations all those are eliminated so that's the benefit of using data caching the next question is when to use it so there are three reasons when you should use it and then we have a three opposite reasons when you should not visit so when you should use it so if you want to reuse the same data frame again and again in your code in your application in your data flow then you have a good use case for caching the data frame or event table if you are going to reuse the same table again and again or query the same table multiple times then it's a good use case for caching the data into the memory so that all the subsequent operations on the same data can happen from the memory and you get uh better performance opposite to that is if you are going to use the data frame only once or you are going to access a table only once just one query or one set of Transformations and one action on it then you don't need to Cache the data frame there is no benefit you are going to get out of it so do not use data frame caching or table caching if you are going to use the data frame or a table only once if you are going to use it repeatedly then you should consider caching the data Flame the second reason is uh you should have enough memory to Cache the data so if you don't have enough memory even if you try to Cache it spark won't be able to cache data for you or maybe a very small fraction of the data will be cached and it will not give you any meaningful performance benefits so for caching make sure you have enough memory to Cache the data whatever data you want to Cache you should have enough memory so for example if you want to cache 5gb of data or one terabyte of data you should make sure that your application has enough memory allocated to Cache the amount of data that you want to Cache if you don't have enough memory don't cache it because you you are not going to get any benefit out of it the third is uh you we cache data when we expect a significant or meaningful performance gains uh if you don't expect a meaningful or a significant performance gains from caching data then do not cache it it's unnecessary so what is the scenario when you won't get significant or meaningful performance gain let's assume you have two small data set you have 5 MB 10 MB data it's kind of lookup table there is no point in caching that 500 maybe 10 10 MB data set if you even if you cache it that's not going to give you any visible meaningful performance benefit because even if you are reading it 10 times reading it a small volume of data from disk and caching it into memory and then reading it from memory will not give you any significant or noticeable benefits so we don't bother caching small small data sets what we do is we cache the large volumes of data if we want to use it repeatedly and we have enough memory to Cache the data so that's about the basics of data caching then the next part is to understand how data caching works and then there are some Associated questions related to data caching that I have listed here so if you look at the first question I often get this question asked from students spark is an in-memory Computing system right this part is advertised to be an in-memory computation engine so what it means is spark performs all the operations in the memory if his spark is already performing all the operations in the memory then why do we need to specifically or explicitly cache the data into the memory what is the need that's an obvious question so we'll try to understand this answer to that question why caching data is important even if spark is an in-memory computation engine everything in spark does is going to happen in the memory so but we still need to Cache the data for gaining uh performance benefits so we'll try to understand that and then there are some other questions that how do I check if data is cached so I try to Cache the data frame or I try to Cache the table is there a way I can validate or verify uh if data is cached or not or is there any place where I can go and see what all data assets are cached in my uh application so we will learn that also and then next question is I cached some data data is pinned into the memory now I am running from SQL query or I'm running some data from frame transformation on the same data how do I make sure or how do I check if data is coming from Cache or is it again going back to and reading from disk so how do I verify how do I see if my queries are taking data from the cache right so we will learn about that also uh and then next question then there are two more questions we can definitely cache data frame but can we cache the tables can I cash the entire table or can I create a view uh on the table selecting few columns applying some filter a smaller set of the table and create a view and can I cache that view so is that possible answer is yes it is possible we'll see through a demo but when you cache a table or when you cache a view then next obvious question comes in your mind is what if somebody modified that table I cache the table or I cached a view but view is just a runtime query right I cached it it came the amount of data that I wanted to Cache it came and pinned into the memory but behind the scene some other session or some other application is modifying the table itself what will happen do I get a still output or uh spark will do an automatic detection or it will refresh the cache automatically what will happen so that's the next question so we'll try to understand all this with some examples I'll give you a small demo and uh we will see all that in action so let me switch back to my uh databricks a Community Edition I've prepared a notebook to give you a demo uh to run this we will need a cluster so I already have a running cluster here so I can run my code let me attach my notebook with the cluster right so first step that I want to show you is to create a data frame read some data from the source file so we have this source file uh firecalls.csv that data is sitting at this location so what I want to do I want to read the data and create a data frame you already learned how to read and create read data and create a data frame so let me run this cell and uh as a result of this what will happen is spark will read the data and create a data frame so what what is spark is going to do I hope you already know that but let me repeat in this case in this example this part will go and read this file uh only first block of this file and try to infer the Header information so that it can build a column name list of columns for this data and then it will again go and read the same file to in for the schema for all the columns so when we are reading CSV file with these two options spark will trigger two jobs one job will infer the will read the header from the first block it will read only one block the second job will read in at least one file one data file since we are reading only one file so it will read the entire file all the partitions of this file so it will read uh some data and try to infer the schema for all the columns that's what this part is going to do and if you want to see all that you can go to spark UI and you can see all that so let me go to spark UI and uh this is a spark UI I hope you are already familiar with this we can see there are two jobs in this park UI the first tab is jobs tab we can see two jobs running here and let it finish okay it's already finished so what is spark has Let me refresh this job step again so what is spark has done it triggered two jobs the first job was to read one partition and in for the or understand the header so first partition it will read and take out the Header information and that's why you can see only one task for this job right only one partition it will run the second job will read the entire file and there are nine partitions in that file so you can see nine tasks here so it will read the entire file and uh in for the schema for B for all the columns that's what it has done and you can see this has completed so that's how spark works when you read data and if you are giving options like info schema and header and all that this part will go read the data and create a data frame this data frame is already created try to understand that point this data frame is already created we saw that two jobs executed but nothing is uh kept in the memory so if you want to see what is there in the memory what is uh cached in the memory or what is uh what all things are pinned in the memory you can come to these storage tabs so if you come to the storage tab you can see everything is zero right so nothing is cached in the memory so what is spark does it will read the data create a data frame and then leave that so for reading it will bring everything into the memory uh but once this data frame is created it's done it will not keep track of what is there in the memory nothing is pinned in the memory and the garbage collector will uh collect clean the memory if it is unused right so that's the first step now I have already created a data frame now let's see what happens if I do some kind of query on that database so this is what I'm doing right some kind of query it's simple we are doing Group by and then a calculating Aggregate and in the aggregate we are calculating Max and Min of the delay and then we are selecting three columns zip code which is my grouping key Max delay which is my aggregate main delay which is my another Aggregate and then I'm adding an action here so all these three are transformations and this one is an action so I'm writing and right I'm using a right action and trying to write the data at this place so this one will trigger uh some more more jobs right so let me run this and then we will see what happens so I executed this action along with three Transformations on the same 5D of data frame and this data frame was already created here right so in general ah anyone can make some assumption what is that assumption which is wrong assumption as per spark architecture and the way spark Works uh but that assumption is you may think that this data frame is already created and it should be available in the spark memory and then on this data frame fired you have same data frame when I'm applying all these Transformations and actions all these four three transformation and one action should be executed uh on this 5df without need for reading data again right because data frame is already created and on that data frame I'm applying all these so it should not go back and read the uh data once again but that's not how spark works the ispark Works action by action so this is my action right action right so for each action this part will go and start backtracking what all do I need to do for this action and do everything everything from scratch right so for this right action is part will backtrack okay I need to do select transformation and I need to do aggregate transformation I need to do group by three Transformations I need to apply on the fire DF data frame but what is 5D of data frame so it will go and backtrack and look at okay for fired your data frame I need to read this data from the disk and then create the file DF data frame and then apply Group by aggregate select and then write and that that's how the entire operation will complete so spark will go read the data again then apply Group by then aggregate then select and then write all this will happen again even if I have already created data frame spark will not uh reuse the same data frame it will go and recreate the data frame once again right and you can see that so how to see that uh if I refresh the storage nothing is still cached because uh that's expected we know spark will not automatically cache anything if you want to see what just happened come to SQL and data frame Tab and look for the last query the uh this is my code the last one which I executed this thing so look at the execution plan of this click that and you will see the execution plan and this is also known as a spark dag or directed acyclic graph which represents execution plan or step by step execution What spark has done to achieve this right so if you look at this and in the current version or the in the latest version of databricks or even in spark it is from bottom to top so first operation is at the bottom and last operation is at the top that's how we read so first operation is scan CSV my I executed this one right this is the code I executed this code doesn't have a read operation right it starts from 5df a fire DF should have already been created here I executed this cell Also earlier but we can see in the execution plan that is spark is is starting from his scan CSV that means it is going and reading the CSV file if you expand this you will see all the details so number of files read one rows output these many these many rows uh we got from this one file size of file read total this is the size of the file so one zero eight five point two MB approximately 1GB so what is Park is doing is Park is for executing this query this part is going back and reading data once again this entire thing it is doing once again and that's why we see that scan operation here and then it is applying this uh Aggregate and then this is your Shuffle operation and then again Aggregates aggregate always happens before and after Shuffle two times maybe we will cover that in more detail when we uh dig into Data frame internals but uh this exchange represents a shuffle operation and this hash aggregate before and after we always have two Aggregates so whenever we are doing this aggregate Group by aggregate Group by will trigger a shuffle operation and aggregate will uh will be performed in two parts part one before Shuffle part two after Shuffle and after that you can see result query is prepared and then override by expression so this is the right operation it is right in that that's what we are doing we are doing aggregate Group by Aggregate and then we are doing a right so this is what is spark is doing so the point is I wanted to show you that when you use a data frame apply some Transformations and finally an action this entire thing along with how this data frame is created this entire thing all that is executed for this action right one action executes the entire chain uh from the beginning including reading the data once again so what is part will do spark will read the data create a fire DF once again once this operation is done then your data is uh brought into the memory so fire DF is created in the memory and once 5df is created in the memory this group by aggregate select and write all this happens from the memory and that's why we call spark is an in-memory computation engine because after reading data once everything else until a action is performed in the memory it doesn't go and read data again and again and again every it will read once and then perform everything in the memory and that's why we call it in memory action but once that action is complete this part is done it won't keep the fire DF in the memory it won't keep it for future use it makes an assumption that you are not going to reuse this 5D of once again if you want to reuse then you should explicitly cache it and then only spark will keep the IDF in the memory so what does it mean if I execute a query once again a different query on the same rdf what will happen obviously I hope you got the answer spark will execute all this in memory but it will go back tracing for 5D of how far DF is created and it will find it here and read the entire data set once again so on the same rdf if I'm executing one more query one more set of transformation and action to apply this these Transformations and actions spark will go back and read again read it again and you can see that so let me run it once again this guy took 52 seconds right this guy will again take a lot of time because to apply all these spark will go and read the data again and create the idea of once again so let it finish and we will go back into the spark UI and see what is happening or maybe let's refresh it maybe execution plan is already there so it's running this query is running we can see the execution plan itself while it is running this these are gray means these operations are not yet completed but you look at the new plan for the second query right so again we can see scan CSV so what does it mean if spark is going back and scanning the CSV file once again to create the data frame 5df if you expand this you will again see the numbers number of files read one uh Rose output these many size of the file this much right so and this guy is finished but it took 39.44 seconds so in my scenario I want to create one data frame and then apply two queries on that on the same data frame on the same data set say data set is coming from here so I click read and create a data frame and I want to apply two queries here both and I saw looking at the execution plan here I saw both the queries are reading the data for itself so we are reading data twice from the disk and that's why these guys are taking time this one this guy is taking 52 seconds this guy is taking 39 seconds ah and and this is a perfect use case when I would want to Cache the data so what I want to do is read the 5df and cache it once it is cached then I run two queries on this and then expect that my queries will come will not go back and read data from disk it will take data from memory and they will perform fast so let's try that so I'll uncommon to this and we will start from the beginning once again right so you come to storage Tab and verify nothing is cached here so what I'm going to do is I'm going to read the data create a data frame and cache that data frame so let me run it so what it will do it will again go read the first block to understand the Header information then it will again go read the one file at least one file and in for the schema create a data frame that's all data frame creation is done then I'm telling a spark that I want to Cache this data frame so the next line is 5df.cash but cache is a lazy operation so spark will not cache the data frame immediately what it will do IDF is created unfortunately the entire data set is uh already read but uh still spark will not cache it spark will simply note it down that I need to Cache this so it will not cache it it will cache it later when we run our first action on this data frame so I'll show you that so uh this is done if you come to storage tab refresh this still you don't see anything here everything is zero zero zero zero nothing is cached here so I I know nothing is cached I have executed the cache operation but nothing is cached because spark data frame cache is a lazy operation so it will not cache it now but now I at least I have told this part that I want to cash this because I'm planning to reuse the same data frame again and again so now let's see what happens if you run these two queries or these two uh actions or set of operations or set of transformation and one action and then again set of transformation and one action so if we run this what will happen if you run this what will happen so since data frame is not yet cached we just told we want to Cache it and cache is lazy operation so when I run my first operation here this guy will go and read data frame once again apply all uh so this operation how it will happen it will look at the right action okay right action I need to perform right action now before that what I need to do I need to apply select aggregate Group by on fire DF how to create fire DF uh it is defined here so I have to read the 5df here but also I have to Cache this idea so what is spark will do first step it will go and read the data frame once again and then cache it and after caching it will apply Group by aggregate select and write all these operation will be performed from the memory and then once this is done spark will not remove the fire Dr from the cache it will keep it in the cache and when I run this next time this time spark will again look at the right operation okay I need to apply right action but what do I need to do before that so I need to apply order by account Group by where select uh on fire DF and how to get the 5df uh it will go back and look here okay this is how I can get the 5df but we already have a 5df in Cache so I will not do this read from the disk I will take it from the cache so that's how these two operations will happen so since cache is lazy so data will be cached on first action on the data frame and after that if you do second third fourth fifth sixth tenth all the actions on the same data frame will read it from the cache so let's see how it happens and we can see all that in this part UI so let me run the okay before I run running this let's look at the cache I have executed the spark.read and cache uh and I know cache is lazy so as of now nothing is cached but when I run my first operation spark will cache the data and then perform the first operation so this one will still take a lot of time because this operation is going to read data from disk once again then cache it and then apply all these Transformations and this action on it so let it finish or maybe I can see that in the execution plan so this this one is still running uh but we can still see the execution plan so if you look at the start from the bottom you can still see scan CSV and if you expand you can still see number of files read one and rows output these many size of files read these this is still partial maybe because it is not finished so let it finish and we will come and see the execution plan once again I don't want to see a intermediate results and confuse you so it is still running so this part is reading data from the disk executing this and then maybe it is applying all these operations after caching it it's going to take at least one minute because last time it took 58 or something so let's wait for it okay this is done took even more time uh 2.16 minutes maybe it was doing some extra work caching that data that's why I took little extra time but let's see what happened in the execution plan so you go to spark UI SQL data frame look at the most recent uh execution everything is done so if you look at this scan CSV you will see the details number of files one rows output these many size of files oh same one zero eight five point two MB so for this query I didn't get any benefit of caching data because this was the first time I'm using 5df with an action so first since cache is lazy so it will first time first query will take still take time but if I reuse the same data frame to execute one more uh set of uh transformation and action that should be fast because this time spark will not go and read the data from disk once again it will take it from the cache because data is already cached how do I know data is already cached I can come to the storage tab in this park UI and you can see uh this is an rdd cached in the uh spark memory uh nine partitions 100 data is cached and in memory size is 451 MB only on disk size is little higher because it is a CSV file so it is cached so now data is there in the memory I know that now if I uh run it run a new query on that same data frame earlier we know it took 39.44 seconds let's try it again and see how it performs done 3.92 second so the that's a manifold uh Improvement right so this work faster because 5df was already cached in the memory so think of it as a scenario where you create data frame once and then you apply 10 15 20 uh different Transformations and actions on that data frame and if data frame is large enough it is recommended that you cache it so that all the first operation will still take some time but rest all operations will be fast because they will take it from the take the data from the cache if you look at the execution plan you can see what is happening so this is the last one let me look at the execution plan let's see the execution plan uh we start from the bottom and you still see it stands CSV operation step is there so but I was expecting that data will come from the memory because this data is already cached but if you expand this you'll see everything is zero zero zero zero zero zero number of files everything is zero nothing is there so what does it mean so spark knows that I need to read the CSV but it is not reading CSV it is taking data from the memory and this is what you will see in the execution plan in memory table scan if your data is coming from the cache if you look at the details it says these are the number of rows that we got from the cache right if you look at the previous plans uh look at the previous plans not this one uh even maybe this one we had any scan and after a scan it went into the aggregation and Shuffle and then second part of aggregation and all that we don't see that one step which says in memory table scan right if you look at the most recent plan data is coming from cache and that's why your plan is showing that data is coming from in memory table scan and the scan CSV step everything is zero it just knows that I need to scan the table from disk but since it is already in Cache so it will take it from cache and then apply the Aggregate and Shuffle and then second part of the Aggregate and then we again have some exchange here so one more shuffle for sorting because we have order by in our query and then it will prepare the result query and finally it will do the right operation that's what execution plan is showing here so I hope uh you you got answer for these three questions what is part is an in-memory computation system then why do we need to Cache the data if you want to reuse the same data frame again and again the reason is for each action Spar is designed to uh execute everything everything is that is required to complete that action including reading the data from the disk and creating the data frame so for every action is part we'll go and read the data again and then rest all the Transformations and action will be performed in memory it will read it only once but for each action it will go and read the data again and that's why if we want to reuse data or a data frame and we want to keep it in the memory we will have to explicitly tell a spark that keep it in the memory and the way to tell is to apply the cache operation cache is lazy operation so uh it spark will not immediately cache the data it will cache it on the execution of the first query second how to check if data is cached you can look into this spark UI come to the storage Tab and here you will see all the data sets that are cached we have only one so you see only one row if you have cached four five data sets you will see all that here how to check if my cache is used that you can look at your query and execution plan either you look at the dag representation of the execution plan or you look at the textual representation of the execution plan every in both the plans you will see this in memory table scanner so if you see in memory table scan uh you know data is coming from the cache these are two ways to represent the execution plan one is like text based and one is your dag representation but both represents the execution plan and tells us What spark is doing behind the scene so that's all about these things uh next is uh next question we wanted to understand is what uh can be cache table or can we cache a view answer is yes you can cache table you can cash view but then next question is what happens if cached table or view is modified behind this scene [Music]

Info

Channel: Learning Journal

Views: 8,752

Rating: undefined out of 5

Keywords: apache spark, apache spark interview, apache spark interview questions, apache spark interview questions and answers, spark interview questions, spark interview questions 2023, spark interview questions advanced, spark interview questions and answers for experienced, spark interview questions scenario based, spark tutorial, Spark Data Caching, Data caching in Apache Spark, Cache Spark table, Cache Spark view, Spark performance tuning, Spark performance optimization

Id: KRAS7R2GWgc

Channel Id: undefined

Length: 29min 45sec (1785 seconds)

Published: Mon Sep 04 2023