PyTorch Profiler and Designing Evolutionary Data Systems

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everyone today i will be talking about the python profiler and also dive into the efficiency and the sustainability aspects a little bit so we'll start off by looking at the gpu performance tuning as it is very relevant for this talk and then dive into the python profiler look at how timeline tracing is done and walk you through some of the optimization examples and then really exciting topic about the future how we can move towards sustainable ai so for many people when they shift from uh like running their models or training them on cpus to gpo's it requires a very different kind of a mindset and thinking because in the case of cpus you typically have processes which are optimized for a single thread performance but on the gpu side gpus achieve high throughput via massive parallelism so it is very easy to underestimate the parallelism required to keep your gpus busy and this can be a surprisingly challenging task and it can take you quite a bit of time to develop a good intuition about this so if you take a look at a gpu behind the scenes gpus are composed of what is called the streaming multiprocessors so these are the functional units operating in parallel on the elements of some segment of the input data and if you compare it to the cpus so this is the the streaming multiprocessor are the closest thing to a cpu core and if you see a dgx system you will notice that it has almost eight x more sms compared to how many cpu cores you have on the system and if you zoom in into a streaming multiprocessor you will see that it is mostly filled up with the functional units and it has many more functional units uh compared to a cpu core thousands of fps 32 units for a single gpu and now in the newer gpus you will also start to see more of fp16 and int units as well so some of the common pitfalls when you migrate to gpus from cpus uh include excessive cpu to gpu interactions so for example if you're looping through a launching operation on the gpus uh you know so this can be one example where it's doing a lot of back and forth between the cpu and the gpu similarly uh if your gpu kernels are very short uh for example you're handling uh processing many small inputs so you could be going like back and forth and you will need lot of data to keep all the gpu threads busy yeah other bottlenecks include if you have cpu overheads or io bottlenecks where the gpu is just waiting for the i o to be the data to be loaded so that can be starving the gpos so that's another source of inefficiency and then finally in your framework or your model code itself there can be inefficiencies you could be you know unknowingly copying data for example from cpu to gpu every time if you're just saving your tensors for example and that can incur additional costs so visibility is really very important and that is one of the reasons why we have uh built the pytorch profiler uh so we had this as a tool internally inside meta and we ended up open sourcing it uh jointly with microsoft so this was launched towards the start of the year i would say back with the python 1.8 release it gives you common way to get to your torch and gpu level information all in one place you it does the automatic bottleneck detection gives you actionable recommendations and gives you very user-friendly tools you know so data scientists can embed it inside their vs code or just look at the tensorboard profiler ui to see all the results and get the recommendations and very easily to instrument in your model code and use it so behind the scenes uh which is powered by the pytorch profiler api and that sits in the new kineto library which is in the by torture source tree so pi dot slash kinito that's where you'll find the code for this and it is all powered through the a couple interface from nvidia so at a basic level if you take a look at the pi torch or the profiling api you start off by importing the torch profiler and then you set up the profiling context and you run your code and finally you can print the results and what you see on the right hand side is an output of what that print looks like and we have many good recipes and examples on the pythons.org site that you can look for reference so this is if you were just using the api and printing the output on the command line now when you integrate it with the tensorboard plugin you will install the tensorboard plugin which is a pip install a separate pip install and then again you know you'll do the same import dodge profiler uh now in your profiling context you just need to do this extra step of uh saving this you know the entrees handler and specifying that it should use this tensorboard trace handler and where to save the results so that's the only extra step that is needed and then this will generate a trace file which you can visualize in the tensorboard plugin so this is some of the outputs of what you will see in the tensorboard profiler so there is this nice kernel level view overview you can see all the timeline traces in detail and many more views to navigate now for the advanced features you can now control many things from this profiling api for example you can decide you know when to trigger the profiling part you don't want to be doing it on every step maybe you want to do it after every six iterations after the model has gone through some warm-up time or you want to do some initial waiting initially so you can control all of that so you can decide you know how many steps you want to profile which activities you want to profile do you want to just do it for cpu or gpu activities and you can have a callable handler for saving the results you can also decide whether you want to profile extra data for example record the shapes uh lock the stack uh the memory and for the output you can decide what should be the output type like output option you do you want just a chrome tracing or you want the tensorboard or other output formats so this here is an example of invoking the tensorboard profiler from the vs code integration so when you launch it from there you will get this feature for jump to source code so in the profiler traces when you see the source code traces if you just click on that it will directly take you to the line of the code in the source very handy this is a example of what a distributed training view looks like so when you are training your model on multi gpu nodes you know you can get all the insights about is there a imbalance across the different gpus what are the overheads on the nickel the communication side in the new by torture the 1.10 release that just came out we added support for glue as well so you can get all the details at the for the distributed training and optimize your training runs and another interesting feature is this on the vs code side our data wrangler has been added so if you're using the pytorch python plug-in with the s code it comes with the context sensitive help for pytorch and it has these nice data wrangling features where you can do the analysis like this histogram analysis of the data set that you're trying to use in your model and there are many interesting cool features like this added recently to the vs code python plugin now let's dive into the timeline tracing so this is what the timeline trace typically looks like for you know tracing the cpu and the gpu activities so you will see this different threads on the cpu side and for the gpu all the streams you will see the different streams and at the bottom you will see all the details and there you see these arrows so these are you know the the relationships how they are linked together so you can see those so this is what an actual output looks like uh so this is the chrome tracer view which is built into the tensorboard plugin and you can see here you know all the the actual record functions that you had used to annotate and the actual operations from the cpu to the gpu so the with record function is what you used for annotating so here you see the forward uh forward with the like tw embed lookup etc so this is what you put in the code and then you get all the details below that so very easy to navigate and you can leave these record functions permanently on they have a very low performance overhead so one feature that is uh easy to miss is you know how to get these uh cuda activities so you can click on the cuda activity and look at then at the bottom with the arrow like from the cpu launch how it went to the gpu execution and see the full details of it you know what was the uh start time what was the you know the views of the categories you can get the detail level of gpu kernels used etc in the bottom view the stack view at the bottom another interesting feature is that you know when you click on this gpu activity you can see the additional details like what is the actual sm utilization so normally like people use the nvidia smi to show the utilization but it doesn't give you at the individual operation level utilization so yeah you know for your entire model training it may be giving you an indication that 86 percent utilization is there but when you dive into the actual kernels you will see that the sm utilization is actually much lower so there is lot more room for optimizations so you can get those details from this chrome trace view okay so this particular is an example like you know the previous one after optimizations uh so this you know this was by increasing their request path size and now you can see the sm utilization has gone up much higher so now let's dive into some of the examples uh of these optimizations using the trace analysis so these are from some real workloads uh running for model training inside the meta so one common entry pattern when you move to gpus is that often you will encounter scenarios where your gpu is idle for a long time and it is very difficult to figure out why the gpu is idle so this inactivity can be easily discovered by adding the record functions so you saw those hash you know the functions so once you add that that will surface you where the bottlenecks are on the cpu side and it will help you parallelize the cpu operations and do more of an overlap between the cpu and the gpu operations another uh anti-pattern is the excessive interaction between the cpu and the gpu so in this particular example the exponential moving average hook function was originally written in the python code so it was using a for loop so it was about the cpu was the bottleneck and then on rewriting that same function using the pi torch for each loop the loop was now executed on the gpu side and this ema hook got 100x faster by doing all these operations on the gpu versus on the cpu the back and forth on the cpu another anti pattern related to this was this optimization step for the rms prop uh so in the pytorch side we have this torch for each function so by using that torch for each you can use it now for multi-tensor and everything will then happen on the gpu side and in the traces you can see there's a huge difference between the previous version and the new version where previous version you'll see a lot of these cpu gpu operations going back and forth whereas in the new version it is all happening uh you know in chunks and this by just making this change we got another 12x speedup then the third issue is that i want to highlight is this from another one of our training runs where we had a lot of forward and backward passes dominated by the sync patch norm function so here the model had about 84 x sync batch norms in the forward pass and then there was a three x uh overhead in the nickel all gathered for each sink patch norm and another 2x on the each of the nickel oil reduce so this was uh optimized by you know like uh making these changes and we got a speed up of in the forward pass we got a speed up of 1.5 x and in the backward pass we got a speed up of 1.3 seconds so this here is a case study where we helped one of our customers to optimize their nlp models so originally when they came to us uh and they shared the initial version of the model it was running 2.4 requests per second so very low throughput and we were able to go through the different optimization steps by looking at the profiler output and running whole bunch of experiments and get it optimized to 1.4 k requests per second and with this they were able to meet their sla for uh the real-time use case where they wanted each entrance to happen in under 20 milliseconds so this particular example originally we started off by doing the optimizations on the cpu side but due to their low sla we then switched over to gpu inferencing and we employed a number of techniques to optimize like model.have uh switching to a distal bird base uh version of the model increasing the batch size uh making sure we were not doing any over padding and all the inputs were of the same length and also experimenting with the nvidia's faster transformer similar to this we did another case study with a customer for an offline batch entrance scenario where their original influence pipeline was processing 46 million documents in 21 days when they came to us and by going through the different optimization techniques we were able to bring it down to two days so as you can see you know like these optimizations can speed up things drastically for you a load of the cost especially this is very important as the growth of the ai models is increasing so deep learning is witnessing an exponential growth in the data on the model parameters in the system resources that are being used so house index model sizes have increased which is leading to a higher model accuracy for the different ml tasks so if you take an example of the gpt3 model to increase the model quality the blue score from 5 to 40 required the model to be 1000x larger in size so so you know we are seeing throughout meta similar growth on our recommender models uh same thing on our language model so overall the trend is you know greater model models with larger sizes more data more number of models and we are not able to keep up with all the requirements on our infrastructure so not just on being able to run these models but also from a power consumption perspective so if you take a look at the power consumption aspects for these models across the different phases if you see the end-to-end life cycle so from the data to the experimentation training and entrance so we are observing like across our fleet uh roughly the power capacity breakdown is uh 10 to 20 to 70 for the uh ai infrastructure devoted to the experimentation training and inference side and we are expecting that this will the demand will continue to grow uh you know exponentially year over year so we have to you know spin up megawatts of new power capacity for our new data centers as we are building these and similar challenges are there for all the cloud providers and at a smaller scale every company that is running these models is facing the high cost of training these models and it is putting a lot of burden on the overall infrastructure so taking a look at some of the optimizations that we are now putting in place for our language models keeping in mind the carbon footprint so this here is an example of the types of optimizations that we have done for the large scale language model tasks so why first we started by adding more of caching so things that can be like especially for inferencing uh so if the inference can be reused uh just by doing this platform level caching even with cpus we were able to get 6.7x improvements earlier most of the models at meta used to run the inferencing used to run on cpus now we have started adding support for gpus for the infrared side as well so with the gpu acceleration we unlocked another 10.1 x energy efficiency and then finally the last set of optimizations were through the algorithmic optimizations where we got the 10x improvements and the figure that you see on the right hand side it gives you the overall you know the operational carbon footprint of these large language models so you see the lm is the language model rm are the different recommender models uh inside the meta or our new company name and then on the right hand side you see some of the examples of the open source large scale language models so this here is only the for the training footprint for the meta facebook models we have both data for the inference and the training site so as you move towards thinking about what you need to do to build models that are having a lower impact on the environment it requires a very different kind of a mindset so you need to be thinking about looking at you know the end-to-end life cycle starting from the data utilization efficiency so think about you know the data scaling sampling can you sample do you need to use all the data is the data live over time reducing diminishing so you have to think about things like data perishability become important so as an example in the case of the natural language models we see that data sets can lose the half of their predicted value in the time period of uh less than seven years so there's no point of keeping the data for a long time if that is the case for your scenario so you should assess things like perishability other techniques that are on the experimentation and training efficiency side include experimentation with nas hyper parameter optimization uh not just doing a single objective optimization but doing a multi-objective optimization to get better results make use of better model architectures which are resource efficient and then taking into account the full infrastructure side of the house so how to think about efficient uh scalable environmental infrastructure new types of techniques can be employed where we are working with some of the cloud providers on how to get to a carbon efficient scheduling so a green resource aware scheduler which will do your training in a region of a cloud which is powered by local green energy versus the the traditional energy you can also employ techniques like doing uh using federated learning where part of the training is happening on device so by splitting the training on the cloud and on the device you're able to get some carbon uh you know energy efficiency savings as well and you can you can get better results and in so the call to action for the industry is you know how we can move towards adoption of better telemetry so first start off by measuring and publishing the results so we are working with the different uh research conferences and different platforms and how we can start to publish the carbon impact statements and especially in the model cards as these research papers are published so here i wanted to highlight one startup which is really doing it very well so cohere dot ai is a startup that is you know if you go to their website they already have an environmental impact statement and for all their models they are publishing you know what is the uh carbon impact for the different sizes of their models and compare it in terms of transit atlantic flight as well to make it easily be understandable so here are some resources for you to get started on the python profiler side we are working with the microsoft now to add a better way of reporting this energy usage uh so in the next version hopefully we'll have better tools to for you to be able to quickly measure and also publish the environmental impact of your model training i will open it up for questions now oh thanks very much uh sorry karen did you want me to jump into the questions or did you was there yeah sure go yeah no go ahead denny i was just looking uh to see let's see what questions i uh no problem so perfect perfect well then hey gita for starters thank you very much for this great presentation so let's just jump right into it because we have a few questions in terms of more just to get people up and running okay so for example one of the first things you did call out is that the the idea of looking at gpu and cpu profile recordings right so for folks that are just starting out what what do you think are the first steps to taking advantage of reading and learning from the cpu gpu requires i mean you you've called out obviously the potential carbon impact which we will definitely get into with a few questions as well but how about just from like you know without going that far ahead right just like how do you how do people get to learn and utilize and and be effective by understanding these uh these particular set of telemetry or metrics yeah so a great starting point would be to look at the pie torch recipes so profiler recipes so the steps that i showed at the start of the presentation start with the basic usage just start by you know setting the profiler context and printing uh so that you get start to get an awareness of how uh you know where exactly uh you know what is the overheads what are the gpu overheads how much resources are being used on the memory front so you start to get an intuition about it uh so start with the api for those of you who are more visual oriented uh look at the tensorboard plugin it gives you very good views to navigate through all of this data with charts and all which make it easy to digest we have a ton of great uh blogs and resources so if you are we will be sharing the slides so you can take a look at the resources from this presentation uh that should help you get started perfect we've got some questions coming in from linkedin right now so let me go ahead and just dive into it right away so faehu from um hopefully i'm saying your name correctly faye by the way um uh from microsoft he did ask the question is there a way to run the profiler without changing any of the code for example just add some parameters to a command so we don't have that at present but that is one of the features that has been in high demand so we are working on on-demand uh profiler which doesn't require any kind of instrumentation so either so hopefully we'll have some of these features available in the next version of my torch 1.11 coming out in the first week of march oh that's excellent to hear actually that's uh the the faster we can the easier we can make it it's a lot likely we're actually going to start making use of those okay um and the second question from faye as well is who is a is the pie torch profiler compatible with the cuda graph uh can they like do they have some more uh details uh so are they looking for more like the full coda graph output in the profiler uh i believe that's the implication of the question the middle enough that's the bad thing when we have to do these linkedin type conversations where there is a there is a minute long delay so for faye if you can pro if you can um provide additional contacts that'd be great but um yeah i mean let's just presume that let's go with that as your question yeah so we are adding more support uh for more features so at present you get all the gpu kernel level views and you can see all the input output shapes and you can get the full trace uh other features you know we with each version of the profiler we are adding new things so i would encourage them to open a github issue and you know once it gets supported uh we will definitely review that and see how to prioritize that feature makes a ton of sense okay i'm gonna clarify some more of the especially near the tail end of your discussion some interesting concepts that you brought up which is like for starters this idea of perishability so so when you're talking about perishability you're are you really focusing on the model perishability or even the data parachute so this was uh for the data perishability and i would encourage you to read our paper which i linked up in the talk uh so so basically like sometimes so at present if you see most companies uh when we are building the ai ml models we are collecting this data and this data is like there lives for a long time but there are many scenarios where the patterns change so if you take a look at the retail examples you know with covet 19 the buying pattern of how people buy things online or in the stores has completely changed so if you were just hanging on to that years of data and trying to do the predictions based on the historical trends that would not have worked in this whole period right so in this particular scenario your data is already stale it helps you in having some context for doing your model predictions but with this tail data if you are storing the steel data for a long time you have to be thinking in the future what is the cost of the storage as well so that's where the data perishability aspects come into play oh perfect perfect that actually is very helpful um and so then there's another great question so for companies that are whether they're starting out or they're getting closer to actually having machine learning maturity do you have any suggestions of what you should do from like you know from a caching or sharing perspective right you know there's a lot of buzz about feature stores you know is there anything you can share about how you and your company meta you know approaches that inference cache so inference caching can start very simple actually it doesn't so feature store is uh different from inference uh caching so feature store is more uh when you are you know like using it in your like more more often utilized on the model training side whereas caching of the inference is let's say you are predicting the price of a particular good and it price changes on a daily basis so you don't need to be running this particular inference every time the request to your model is sent you can cache recash it like you can compute these values at night or you know like the first time such a request comes and cache it for the entire duration of the day and people who are coming from the traditional web development side they're already familiar with all these caching techniques so you know have your data scientists work with your software engineers and combine the techniques from both sides so caching people use it heavily on traditional web architectures so use the same techniques on the model prediction site as well perfect i think that actually does cover all the questions for uh from linkedin and youtube today so um karen i think uh we're probably really good to go anything else we should do before we segue yeah no that's perfect perfect timing danny excellent well gina thank you very much we're gonna switch over now to the second portion of this show um and in this case it'll be a fireside chat between myself and scott uh it is the title of it is designing evolutionary data systems there's a lot of um implications with that so it'd be so before i go really into it scott why don't you start off uh by the way you need to meet yourself my friend um and go ahead and provide some context uh of who you are what your background is and that'll probably allow us to clue into folks what we mean by the design of designing evolutionary data systems yeah so my name is scott haynes and um i'm a senior principal software engineer at twilio which is a communications and data company now um and so i think like in general um like evolutionary data systems like i like to talk about it as like where do people like where do you start out and what steps do you take to you know complete the journey that you know gets to you know having an architecture like what you know what's inside of meta we don't know what's inside of there you know as an outsider i don't know what runs in there but you know you can kind of think about like what are you what are the steps that are needed what are the you know like the systems that are needed to be able to you know run at scale and be able to actually you know um you know handle you know handle you know uncertainty um and you know the unknowns unknowns without having to kind of like you know derail feature roadmaps and things like that um so a lot of times like you know if you think about you know going all in on a specific you know technology or going all in on you know a specific you know a way of doing things a lot of times like by the time you get out the door with that it's already time for change again um and so for a lot of times like evolutions you know they happen more often than like you know um a guy that i work with this guy christopher talks about how like evolutions happen more frequently than revolution and a lot of times evolutions are the natural pattern to you know a new you know data revolution let's call it for this you know for this concept um so basically essentially it's you know what components are necessary like what building blocks are next uh necessary to be able to build you know build a system that can scale and also you know kind of going back to what gita was talking about with like data perishability like there's certain things that eventually fall off the wagon and it's okay to let those things go like you know you think about you know it's like i don't know early on like early on career-wise i was not good at letting go right it's like the whole like the frozen analogy right it's like you know just let it go whatever i think that's the song i i now i mean it actually very much is i have a four-year-old daughter that sings that song all the time yes yes it is i think that's a good thing in general to think about like you know don't don't hold on to things and hoard the things that worked in the past um you know very much like what keto was talking about with like you know the 50 you know um like you know 50 decay over a seven year period for data the same thing happens within like a two year two to three year cycle for you know components within like a data platform um so evolutionary systems are you know replacing things while still maintaining a product that's in production that's running it like an enterprise scale without disrupting everything um because it's a lot easier to disrupt than it is to you know to create right so and then one thing i think that we definitely want to call out is that the basis of the context of this conversation we're about to have is very much this idea of working with data at scale right that that concept is hard and because it's hard that we actually have to over simply we have to not over excuse me we have to simplify things so that way we can break them down into smaller components that are less likely to go ahead and actually break under under heavy strain or heavy load and by the way um i forgot to properly introduce myself my name is denny i'm a developer advocate at databricks i'm a long time data guy myself was worked with the sql server team before and uh introduced help was on the excuse me the infrastructure team for um um the prototype excuse me for what is now known as hdinsight i've been a long time brickster as well so the conversation scott and i are going to have for the next 30 minutes or so i guess and please ask questions either on linkedin zoom or by uh youtube uh indian term i'll do my best to go ahead and look at those questions and then ask them is to basically discuss about these concepts so we're going to break this down into basically a four part section so we're going to start with the idea of first steps what to do with the data the second step connecting the dots now what do you do actually do with the data the third step methods for looking at your data problems and then finally the decision making the road to actionable data okay so that's more or less the framework we're gonna probably uh follow under so scott diving into this initially let's let's talk about those first steps like what are you supposed to do with this data like let's take the part in time when you first joined twilio or any other organization for that matter that you've managed to run like you're typically given this problem we're like okay you've got a bunch of data yay what are you gonna do what are you gonna do with it what what is what is the first things that come to mind that you actually have to help address yeah so i think i think like in general um like there was this if you think about data being like you know there's an atomic unit that's something like data encapsulates you know an event a metric you know an observation in a system generally um you know data can be anything like there's this whole kind of notion of like if you think about you know early on like when you're learning computer science you learn about like an abstract bag data is an abstract bag it literally can be anything but it can also be you know something that is you know kind of future proof in terms of what the data is or can be something that you know very quickly kind of corrupts over time and like you know if you think about like you know schemas and having you know binary serializable data like there's this whole notion of like you know you can create you can create a lot of systems based off of like you know semi-structured data like csv or json but like over time like you know if you are you know if you're collecting that data because it's so you know subject to change um there is this you know will you be able to continue to you know use data that you originally captured um does it make sense for that data to go away and you know is it encapsulating like a thought and is it something that can also be you know reused for for anything else as well like i know it's kind of an abstract like you know kind of generally data in general should be able to be reused if you're going to capture it but you know there's a whole side tangent about you know many different things we could go from you know from here but just in general like everything starts with just the data and if you don't understand what it is that you want to capture and you have kind of you know a fuzzy idea of what you need to do like there's you know it's much better to go through an experiment um with you know a very kind of you know uh you know cheap cheaper data structure like something like a json or a csv to understand like you know what what what am i trying to accomplish and then you know go out and take a look at what actually is currently already available like i think it you know a lot of it depends on like you know the stage of the company is in um so you know being you know having been at startups where we there's absolutely no data we have no data like the only data we really might have is like you know what's what's coming in for free from like say google analytics um for example um but you know having to make that you know early decision i was like you know what is it that we should capture i i think is something that sometimes gets glanced over like at you know at you know depends on the company and whether or not they have like kind of good data maturity yet um but i think everything just starts with the data if that's a good answer to that no that's a great answer and i'm actually going to go through the through the question a slightly different way based on what your answer is here which is then how about for the companies that and which i've definitely seen unfortunately my fair share of which is basically the solution is i'll just collect everything and hope that we'll eventually be able to make sense of that can you tell me some of the pros or pitfalls of that particular approach yeah so i think i mean it's um like a lot of people talk about like it's like the data hoarding right it's like you know we'll we'll collect it because you know i don't know if anybody ever watched like old south park but there's this whole there's an episode with like the underpants gnomes i don't know if that's like safe to talk about but it's like step one collect under pants stay there but don't go too much deeper and i think we're fine we know that data has value so if we collect everything eventually we'll be able to go back and think of something to do with it um but a lot of times it's like an expensive prospect to you know i'm like like prospecting like gold but like the prospect itself is usually fit it's like a false assumption right it's if i have data and i don't necessarily know what it is eventually in the future we can you know either you know throw machine learning at the problem to figure out what the data is it's like but really like the easier thing to do and like the cheaper thing to do is figure out what you want before you collect it versus collecting just everything otherwise like it's like a monumental task and it's usually like a very like thankless task to you know to go back through old data attempting to try and find clues um and yeah it's it's i think it's much better to really kind of opt into what data you want to collect versus just taking everything i mean there's also like there's a whole whole sidebar discussion about you know gdpr and like data privacy like don't collect everything without knowing what it is because eventually it's gonna come back to haunt you um so you know think think that whole entire process through so that you know you know i don't know 10 years from now your company doesn't get you know shut down for you know privacy issues and things like that um so oh no that's a great call out and in fact i will probably wait till a little bit later on the decision-making phases specifically to talk about gdpr and privacy more in detail but i do think it's important what you just called out right now which is that even if you could theoretically afford from a fiscal perspective to just record everything the fact that we have legal issues around ccpa gdpr privacy convergence data breaches all these other things that you no longer can really afford to do that anyways i would take you probably agree with that statement yes i i agree with that perfect perfect okay well then let's shift over to like let's just say connecting dots because you've talked about like like looking let's ask the right questions for what the data is or look look for some initial patterns to figure out whether we're collecting the right thing how do we connect the dots what are some key facets or points that you really need to call out for people who want to actually really connect the dots of the data that they actually have yeah so i think i mean there's like a there's a segue to you know having you know having structure like in your data and having like you know a you know everyone everyone talks about data catalogs right but you know having a having a data catalog um you know it's kind of like back in the day like i just remember getting like the sears catalog which is like just you know showing my age and it would come for christmas and there's a bunch of items in there and you can figure it out by category you can figure it out you know by type by price etc um it's easy to kind of locate what you know what you're looking for and like i think nowadays like there's this notion where it's like you know we'd love to do the right thing everybody would always love to do the right thing but a lot of times you know there's almost like it's like well you know this data is under the radar or you know this team owns that data this team owns this data you know talk to them you know slack them have a conversation about how to get it but a lot of times it's like that leads to you know very kind of large lead times and to how people can actually collect and use data or even kind of collaborate with data if there's not like a single you know not a single location and or process for you know finding and locating and you know um like like annexing data is not the right word but just like i mean i think it's just that that process of knowing that something exists like um like i see this a lot with like different like machine learning teams that i work with it's like you know we need data for x y and z and you know without saying what x y and z is like you know data data powers models data power is everything but if you don't know where to look for it then it's like it's kind of long road to just trying to get something out the door it's like you know we have this idea we have these expectations but if we can't power it with data or we haven't collected the right data like or the data is also not trustworthy because it's you know the feed stop the data pipeline stop you know the data engineering team no longer exists like there's certain issues that you know can really stop something from running you know it's kind of like the california water crisis right it's like you know we'd love to have water but we don't have any so you know let's figure out something else like you know maybe we'll ship it in or something i don't know it's not a good analogy but like i think actually i've got a pretty good analogy that i like to use if you don't mind actually for you so so for example you know how we talk about this idea like oh data catalogs or even data dictionaries actually what i usually like to refer to is like no actually what you have is a dated dewey decimal system okay or index cards and so for all you young folk that may no longer remember this idea that you actually go to this massive set of little drawers where you pull out these little cards to go ahead and actually find the library the the the book that you need the data that we're talking about exactly to uh scott's point isn't just basically this gigantic uh uh um like but gigantic file system just bleh right what it really is is that we actually need to organize and know what it is and so in the essence it's what your data is more like a data library okay more than anything else well then how do we index and organize or figure out what exists is by actually in the library it's the dewey decimal system it's basically the index port and in essence that's what we sort of need to because otherwise exactly to your point scott like we don't even know what data exists and then power supposed to even analyze any of that stuff trial by fire right it's like exactly yes i i think it yeah i i think it's like i think it's one of the things where um like i think along this term like along the kind of like the journey of you know not even getting human we haven't talked about even building an evolutionary system or anything else yet but like if you think about like what are the steps necessary to get started using data in general um like a lot of times you know depends on the size of the company what you really need like do you need to run everything in stream probably not do you need an effective way to ensure that data arrives probably um and you know there's different ways of looking at you know problems like do we want to collect you know all sensor data um do we really only care about whether or not somebody has you know purchased an item in like an e-commerce you know situation um what is it that we actually want to kind of track um and is it also like you know is it non-creepy to track the things that we want to track as well like you know what is it that's you know best for the customer or what is it that you know i guess like uncreepy is like really just top of mind in general just because of like you know privacy in general like i i love to know i love to know like how my data like how my personal data is being used in general i also think it's great like i'll give like i mean like it's i don't know anyways i wear the apple watch and all the stuff that works with health there we go i like i like looking at health data i like being able to track like you know was i anxious or not anxious was i did i sleep well or not like how does that change and affect me and so like i'll give away my data to you know different applications that run a watch app just because i don't necessarily trust that companies that are collecting the data but i like the results and it's kind of like a weird thing where it's like um it's like that movie what's uh the circle right it's like at what point are you um maybe that's not it um it had like a hermione granger in it um and like tom hanks back in the day it's basically like you know in the future we're tracking everything and everything's on camera but because things are more transparent we're better people for it it's like it was kind of like an interesting you know study on like you know life and um you know supervision and opt i don't know um i guess uh it's like the the i don't know it's like the eagle eye big brother etc right it's like um at what point are you willing to you know opt into you know less privacy in in order to you know for the better good of things and that's a tangent um well i'm gonna go back on track no that's okay it's in fact actually the the circle i just completely forgot about the movie so my apologies for that but then okay so right now up to this point just exactly like you called out is that we we're just still we haven't even talked about evolutionary we're still talking about what is the data we have and what we're gonna how are we going to index or make sense of it okay the what was implied but not actually called out which actually is part of that evolutionary process now that we're going to jump into this is data lineage and what i mean by that is that don't forget when you're taking all the pieces of data that you're trying to process right the reality is data is going to get joined with other data and then there's new data that's going to be created and so there's a lineage of understanding what whatever happened to something upstream is going to invariably impact a system downstream right and so because that's actually the real problem how do you handle things like how do you have what's the framework forget about the technology for this second what's the framework or mindset that you have for things like lineage i'm curious yeah so i think i mean if you think about like i think if you think about lineage just as like you know think about like google maps right or you know um you know some kind of generic maps application um let's let's i don't know like i can't even remember the old apps that i used to use um but anyways like if you think about like you know waypoints in a journey right you know you start somewhere with data you collect it it moves along you know potentially like you know a route that changes significantly over time and at the end at the end of the day like there's something that you need to continuously rely on like in terms of like you know the output at an egress point and at what egress point because a lot of times like if you think about like you know a traditional data pipeline for example like you know it was you know let's put stuff in like a data warehouse and that's going to be kind of like the end of the journey but i think nowadays like there's like this whole notion of like you know you think about like data meshes and fabrics and all this other stuff or if you think about like you know how streaming data has kind of evolved over time with you know all the streaming platforms that are out all the different ways of you know collecting that um there's this notion of being able to kind of like you know adaptively change like a dag um or like a you know directed graph of data that's moving over time so you might be experimenting you know if you think about like the same way that you can area applications you know how do you canary or experiment with data um you know there's many like there's different ways to approach that um and there's like you know there's there's a hundred thousand ways to do everything um so it can be done like you know if you take a step back from like you know data systems that you know are being used to just you know kind of collect and like you know mash data together um you know for a lack of elegance and like you know in that term right we're joining data over many different you know tables um but the problem i think with a lot of like the lineage is that like some data doesn't ever arrive right some data's gone and it's like you know i think it's there's this problem with not you know it's i don't know this anyways i was going to talk about like absentee fathers or something but um if you think i'm sorry just anyways moving on um there's there's this whole notion where it's like you know who who's who's basically responsible for data i think data lineage kind of starts off with this notion that there's somebody who's producing data the person or entity whoever is doing it right now i think there's a lot of teams where it's you know there's a team providing data and producing data that data has a frequency like how often are we you know collecting from iot sensors you know what's that interval of time where we're creating new data but then also data has different time intervals there's different times where things need to kind of line up which usually never works in like in practice in theory it's like you know here's a complex dag and everything's gonna always land on time we're gonna give everything about you know a two minute wait time and you know we expect everything within like 10 minutes or something like that like in reality like you know it's wishful thinking for a lot of stuff um but you know the lineage itself like just knowing like what was my upstream if nothing else you know what's kind of like you know a partial graph of my data do i trust my upstream to produce data that i'm expecting to have um and i think that this is a longer conversation as well um it can be done through tables it can be done through you know structured data um you know different you know avro protobuf different libraries can be done with parque schemas right but so the key context though is that even though the techniques may be very different depending on the environment you're in what has to be solved still is going to be the same irrelevant of which tech you use right so there are issues about being as you implied with the dags right what is the data change frequency what is the change in the order of which the these particular tasks need to do uh what are the governance aspects that go with it and most importantly how to deal with failures right so why don't we talk a little bit about these aspects whether it's the change frequency the governance or the failures actually yeah so i think like i think if you think about like the change frequency um like there's many i'm like i i always like it's like i like to say like data comes in all shapes and sizes you know here's a general here's a generalism um but certain data changes a lot more often than other data points like if you think about like you know um a more mature company that has very specific you know table structures for their data chances are it probably hasn't changed in a while like you know you might be adding you know uh you know a new foreign key that's representing a new you know a new table maybe but a lot of times like that data changes infrequently like especially if it's you know something representing like you know a stable product um but then there's also data where it's just message passing like between you know many systems and for a lot of stuff now where it's like you know i need to understand you know say for logistics for example like i need to understand where all my trucks are in the united states at all given points in time who the drivers are and i also want like the the consumer to know that you know they're number eight on the delivery schedule and by the way you can literally drive and follow the truck like there's a lot of like those different like you know it's like on track i think i can't remember any of the company now but um i it's it's interesting like the kind of eyes and ears we have like in these two into these different systems but like being able to kind of get to the point where you you know you have the ability to kind of monitor everything at the same time it's just a different kind of problem like you know one's kind of time based you know it's more you know i care you know i care about how this geo position is changing over time it's kind of a difference um it's a different thing that you're solving for so i think like kind of a lot of this ends up going back to like what is it that you're capturing data for and um all of that stuff's going to kind of propagate through your through your system um in general right and then i think the implication is that because we're talking about evolutionary systems what you're collecting and the reasons you're collecting are going to change over time as well which is why that's tied to your lineage and data change frequency just because the fact that if your system isn't remotely usable and actually being used by people it will invariably need to change to accommodate new metrics new data new or anything else right but then i guess the question is that knowing these things and knowing that you do actually have to care about these things and be and build the engineering processes or engineering rigor to do that how do you deal with the fact that in reality the vast majority of the time you have to prepare for data failures whether it's failure processing failure with the data that is not in the expected format or whatever else for that matter that that failures like let's let's let's just harp on that just for a couple minutes yeah so i think i mean if you think about like any so any system will fail given enough time like given enough uptime it will fail um it's you know it's a general a general rule of thumb i think with data like failures can be a lot more catastrophic depending on you know the you know the fragility i guess is the right word like if you have a fragile system that's you know expecting it's always expecting a specific event and if the specific event never arrives then you have memory leaks because you're waiting for one event to push data out of the system for example um i i think doing game days like in general is something that you know it's kind of like an older kind of devops type of you know strategy it's like what happens if you know say netflix is chaos monkey for example what happens when we turn off cassandra nodes at random like what happens next you know do we you know what's the time like what's the you know mean time to recovery like mttr um how do you track that like also on a data point like for for data like data's data is very complex in general like when you start thinking about like these you know these dag chains that might have like 100 100 basically levels of it kind of indirection or it's really just like a graph of data that's you know moving through a system over time like pinpointing that like there's a lot of like there's a you know there's a lot of movement now like towards like you know data observability and being able to you know track fee and metrics like you know what's the completeness of like my data like you know if i expect to have a specific you know data format in general because that's like the expectation you know that's my data contract with like my upstream um so let's say upstream just because upstream can be an api an upstream can be like a kafka topic it can be you know pulsar it doesn't really matter it's some sort of data source that has you know i trusted that the data source is not going to let me down and now it has for some reason whatsoever like being able to you know test that and actually know like what is it that i'm you know what is it that i actually really do expect like there's a lot of systems where it's kind of like fire and forget it's like okay like we did our we did our job you know we sent some abstract you know map of any any like for you know people who under you know from the java world like here's something that's unrecognizable you figure it out later like there's there's not much you can do right like it's like this could be anything i really hope for the best in the future you can't really test for failure because like there's not really much to test like or it's really expensive to test but then there's also different things if you think like you know confluence registry like you have a specific you know specifically one or more you know schemas that are structured it could be avra protobuf etc i don't know if they have more than that now but it's tied to your topic so you have kind of like this you know you have a channel for emitting data of a specific type and you know you can validate literally at like the ingress point you know via you know the confluence registry or con i don't know how a lot of the stuff works um you know many different ways of doing it um you can also do the same stuff like through like api gateways or you know depending on like what it is that you need to do um you know validating that your data that is you know i guess like the lifeblood for your company et cetera like that sounds kind of cheesy to say it but like for data that you don't want to fail a lot of times like that's something that can be solved more easily you know by you know having a binding contract this is our version of this schema for this release of our you know api server that's what's generating our data that's what's going to flow into the system if i'm sending invalid data it's easier to basically send back like a 400 to you know a service this is like you know for the rest days or whatever else like it's like here's an error please go fix yourself like this canary is bad like you're going to lose some data but it's a lot easier to lose data like in like you know a partial you know a one percent canary release than it would be to be like wow we turned on hundred percent now all of our data is gone it's like well that's a lot harder to fix and it's a lot harder to fix once it's entered into like say hit my microphone as it as it enters into like a streaming network in general um so you know how do you fix it at the edge like the edge is important if you think about like you know this being like the transition point or the hand off point you know say you know it's a carrier network it's you know whatever it is there's an agreement and we don't want people to let other people down so you know fix it at the source um and i think with that like it's if you have guarantees about you know i guess if you have confidence in like the the data that's going to be received through your network it helps with failure a lot more because you know what to test for it's like well here's you know here's the stop gaps that we have in our system you know say it's a grpc server that has a you know specific protobuf and that's going to be basically there's no other way of sending data so the only thing that could be incorrect is whether or not you know people potentially break backwards compatibility with it with a release um that could change and kill everything but it's a lot harder if people are following best practices than if you're you know sending you know a random kind of json blob that might change depending on release or which server's running or like you know a lot of things that like you know i i can be blamed for that back in like the day as well like it's like oh like it's much easier this will be a lot less effort like we're just gonna literally send anything um but you know as as you kind of mature like with your data operations you a lot of times it's like you move more towards like well you can move fast and break fast but it's really bad if you break the you know the eyes and ears of like you know the systems the company depends on so being able to kind of take more time and making sure that like the upfront process is there like i think you were talking danny before about that like you know how is it that you make this something that is you know i don't know it's like widely known best practices for you know data teams or people producing data within like the company and if you have you know best practices to follow or you know even you know even low code systems or other things like that that make it easier then a lot of times like people will just you know adopt you know what's available right exactly and i think the the key call out when it comes to those best practices as you're saying is the fact that and you actually call that out which is this concept of a contract like if we go back to exactly the red the whole premise of rest api services right the whole premise of us saying we use rest api to send data in bulk itself has its own set of problems right because it's very extremely chatty systems um we end up saturating the networks it's a lot of fun and i'm being very facetious for obvious reasons but oh sure let's go with that that's fine right by the same token right the whole premise of this of us talking about in terms of rest apis was this idea of a contract right as in like as in when you send if you send this correctly formed message we will return to you this correctly formed response in this particular format you know whether you're talking about protobuf or parquet or whatever else but it'll be in that or json excuse me and they'll be in that correct form for you then to digest and then the likelihood of it failing assuming you follow best practices like when you add new columns you will have new versions so that way you'll be tied to a particular version you're less likely to fail but that's the whole point of the building up that contract and so even though we sound like we're a bunch of lawyers now because we keep on using the word contract that's actually the most important aspect of scaling right this idea that as you evolve there are contracts made between teams right down at the api level so the different teams that are interacting with each other can interact at that contract level so they know exactly what is being expected from them and for their input and what the expected output is going to look like yeah i think that was i think that's a really that's a succinct way of kind of encapsulating like the whole i guess like that like the last like 10 minutes of me bla blabbering um but it's you know i i think it's just like i i think in general like it's i think it's it's it's like kind of like it's like of like the utmost importance i think for you know for you you you know we talked about like the date like data lineage itself like before you even begin to even think about what you know what is it what how is data mutating over time um more like more than that it's like well you know should should data mutate outside of like the scope of what we need like you know you think about you know think about sql right you can project and you can select data and you can create what you want out of it same kind of thing with a graphql but behind everything there's always this like very solid structure you know here's the table format i'm not going to randomly just you know like we can't like with sql tables like we can't go and say like you know what forget it we're going this column's out of here it's like you i hate you you're out of here it's like you can kind of deprecate it and you can you know you can do things over time you can you know even you know start writing two different tables and slowly kind of deprecate an entire table but a lot of times you're stuck in that base contract depending on who's using it you can't necessarily you know i don't know you you you can't break that you know break that trust in general um and you know there's a lot of things i think that we're kind of baked into early so like sql systems that you know i don't know i think back to like lineage of like you know other data evolutions right it's like you know moving from you know fully structured you know sql moving into nosql it could be anything you know it's a document store you know those pros and cons like you can literally put anything in a document store and you can you know you can literally massively scale it horizontally and you know you're really not having to worry about like the same kind of you know separation of concerns with like a traditional olatp system or you know a regular kind of like old-fashioned traditional like database um but you know there's pros and cons to that because a document can be anything and it can always be anything and i think i remember like back in the day like you know it's like oh like i was at a uh silicon valley has this meetup group i don't know if it's still in existence so it's called the cloud computing meetup group and um i remember like sitting back at like carnegie mellon this is i don't know 10 years ago or so and it's like you know here's like uh react database tokyo db here's uh redis and then here's like cassandra and it's like you know here's different kind of flavors of like no sequel pros and cons for everything i remember like being like oh man this is awesome like i hate schemas like schemas are terrible like this oh yeah yeah 100 yeah i hate those things yeah so i i don't know i think like as like as things kind of change over time like there's always like this you know you kind of fall back like i feel like it's like me getting old but it's like you know i i need something that's structured it's like you know no no no actually i want to interrupt you slightly but only because i know exactly what you're talking about so in this case for example i came from the sql server team myself at microsoft right so very much structured you know the structure structure structure you know you you have to do it exactly fine one of the all the advantages structure we couldn't scale for the life of us but we could certainly you know we can scale on this you know single single box of course we could do that uh even with numa knows yeah i'm bringing that back but the the context is like you know this idea of distributed horizontal scalability and i forget that that's not gonna happen because well we're just gonna do it and so then then i was on the team that brought in what is now known as hdinsight uh e hadoop on azure and windows it's because we're saying nah forget it we don't need to keep it schema on rate that's a solution of everything the nosql databases were coming on the same times like yeah we just need document source we never need the schema we never need the schema and then so that's the v2 version of the data world and right now we're in the v3 and then at the risk of sounding like i'm a marketing guy like where are you gonna use terms like lakehouse it's actually accurate right from the standpoint of saying look the problem is that we actually want the flexibility of these like nosql worlds or these data legs but we still need the reliability and enforcement of some type so that way the data that comes in actually we can track it have links to it have some form of governance and so then this is the v3 world i.e the lake house world this is where we actually have to combine those two concepts together so in other words you can have structure with contracts yet also scalability and flexibility at the exact same time yep yeah i think yeah that's i think that's that's totally accurate i think you know if i think of the other thing too that like lake house and like other like kind of like the new like emerging architectures like you know you know i think about like you know we had no sequel we had sequel we had new sequel and it's like oh man like this is you know this is great like what is new sequel it's like well it's you know horizontally scalable and like you know we throw cluster scale to make you know replication you know replication basically takes care of you know the fact that we're not going to have like a you know a vertically scaling node um but you know it comes at you know issues with cap theorem like there's issues with latency there's issues with like you know um you know partition unawareness like it's like well you know i don't know where grandpa went like it's okay like um it's you know some of those things where it's like you know i don't know sorry about that um but it's you know all good about all good yeah how do you how do you basically architect for failure kind of go back really back to like that whole entire failure issue and you know do things still work like you know that was kind of like the early like mapreduce you know programming model it was like you know we expect failure we expect notice to go down like let's just replicate a bunch and you know we can just you know we can throw the whole entire cluster at the problem and you know then you know other revolutions come from there right it's like you know well this really works well but you know how do we how do we you know how do we add caching and iteration to all of this and you know how do things kind of grow up from there and i think now like going back to you know that kind of you know it's like choose your own adventure like do you want structured you want speed what's the trade-off you're willing to make like with your actual data sources like i think most people you know eventually love consistency right it's like oh there's a long joke about eventual consistency there but like i i would much rather like i would much rather know exactly what data i have when it lands where and that's a hard problem to solve like data lineage is difficult especially when you have you know it's like the whole micro services type architecture right it's like well we went from a monolith to a bazillion different microservices now everyone has their own opinion of what data looks like and how do we converge on something that allows us to you know connect and read all this data but how do we still make sure that it works like you know right okay that gets harder right it's like okay you don't want to micromanage right absolutely so okay now let me let me wrap this up so that way we can at least for the remaining like five seven minutes talk about the road to actionable uh data so one of the things that we sort of skipped but we actually did call out all throughout the conversation so far is about that the methods for looking at the data we talked about ad hoc versus batch streaming and we didn't talk about beyond but only because we probably spend too much time talking about quantum computing so let's not do that discussion now okay um but let it let's definitely talk about that road to actionable data because we actually been sort of harping on this in terms of the word contract and an observability for that matter which in this case we're talking about slas right we're talking about data ops or mlaps so let's talk a little bit about those how do you how do we put this all together now you we've talked about the structure we've talked about all these other things and we've talked about contract so like how do you how do we go ahead and actually now allow us to act on this data so that way we can evolve our data systems while we're at it yeah i think i mean if you think about if you think about everything it's like you know i guess like right now we've talked sort of through like a four like a four-step process and we've hand-waved through a four-step process like there's a lot more like involved in like the nitty-gritty to make everything work but i think like like at the very end of you know at the end of the road right it's like well we've you know we've come we come to the like the close of this journey and now we have this data that's you know available for processing like the question comes up of like well what's the you know how often do you want to process and what you know what strategy do you take and this is like a whole tangent which would take us till 4pm or you know whatever whenever people read this you know not read this that's anyways when they view it later but um there's this whole notion of like time right it's like you know how often do i care about this data so i think gita brought that up really well with like the whole entire idea of inference caching like how often does my data change like do i care if it changes like there's a lot of stuff where it's like you can't even cache it because it will never ever be cacheable and there's other things it can be um so i think like if you like if you think about what it is that you are doing with your data what's the end result of what it is that you're trying to accomplish that's then what you're kind of architecting for anyways so you know is this point in time snapshot you know am i doing reporting on this data do i care whether or not you know do i care about you know minute level you know granularity do i care about how things are changing hourly daily sometimes i might only care about whether or not customers say are churning every seven days um maybe monthly i think like the way that you look at the data changes the way that you know you're going to build the system anyways so regardless of all of that stuff i think it's kind of tangential but if you want to basically take your data and do something with it um you have different ways of you know querying your data so do you want to use you know are you using sql are you connecting this through like you know jdbc to you know tableau or some other you know internal you know metric store or something else like what is it that you're doing i think it's like kind of coming back like i think we started there like what is it that your data is capturing we're kind of ending there as well because like when you know what you want to capture right that's right you've collected it you've gone through all the heavy work like i know like we go you know we glanced over like throw monitoring in there right like data monitoring like does it work does it not work am i missing data what's the latency um i think it was i think it was airbnb they created like um they had like their streaming data visualization tool um that they were showing that there's a medium post about it i can't remember um so i apologize if i just screwed that up and it's somebody else but they had this idea of like you know what's the operational latency of specific you know data tables for example and being able to know whether or not like the slas have changed so you brought up slas before like what are my server service level agreements like what are the indicators like how am i actually capturing the data and seeing you know seeing how things change um regardless of all that stuff like you know it's basically devops like 101 like type in you know information like you know sre type information like but if you think about all of that just in general let's roll it all up into like this you know huge bin of stuff and forget about it and move on to the next step once everything else is actually working and you have data available um then you know again we glance over data governance and like pii and who's using what where and all that other stuff but if you want to basically you know make things that are actionable a lot of times it's literally just connecting to something so are you connecting back to an api are you throwing this stuff back right kafka what is it that you want to do again it goes back to that whole entire question which you know you can kind of leave as open-ended but there's ways of connecting back to your data and there's many ways of doing that nowadays um you know basically it's like you choose your poison right it's like you know how do you want to do this and you know is this you know are you connecting in a way that will you know is future-proof as well perfect so i wanted to go scott did not ask me to do this by the way i did this on out of my own version scott actually happens to be the author of an upcoming book called modern data engineering with apache spark a hands-on guide for building permission critical streaming applications so i went ahead and actually popped it into linkedin to youtube and then and uh and to um the zoom here but the reason i'm calling this out is because exactly to scott's point you know scott and i have always a great time chatting with each other so the fact is we probably would do it even though there's only no if there wasn't a major event we would just have to subversion anyways it could just be with us actually for that matter it probably is just the two of us but the things that irrelevant of that the the real key call-out that i want to sort of make is that scott's has a ton of experience when it comes to building these systems and and when we want to go deeper into the various concepts we've talked about here whether it goes to lineage or contracts or anything else a lot of these concepts actually are in fact covered uh much more in depth in his upcoming book so i i do think it would be a really great idea for you all to go ahead and check that out so i just want to give you a quick shout out i hope that's okay thank you for the free marketing right hey i'm going to be worth a little bit i got to be helpful sometimes perfect well then i think this has been a fun conversation like always scott i'm going to switch it back to karen to close this up but like always uh any questions uh the other way place you can always check uh scott and me out we're both basically uh hanging on on the delta users slack channel and all that stuff anyway so you can always just ping us there thanks denny yeah and i'm sure you'll have uh some upcoming sessions continuing this conversation in the future which no we wouldn't we would never do that all right well thank you so much scott for your time it was great having you for this meetup and thank you gita too she had to run but and thank you everyone for joining so uh again this recording is available on youtube so check out our youtube channel and i know denny and gita uh posted a bunch of links to their materials so i'll go ahead and update the the youtube description with um those links so i thought i hope everyone has a great rest of your day and uh take care thanks bye
Info
Channel: Databricks
Views: 484
Rating: undefined out of 5
Keywords:
Id: 5gQeXKroKng
Channel Id: undefined
Length: 79min 48sec (4788 seconds)
Published: Thu Nov 18 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.