DC_THURS on Spark w/ Tathagata “TD” Das

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome everyone to our first dc thursday of 2021. um great to have everyone here i know some of us geeks might be glued to the price of bitcoin watching it approach 40 000 40 000 usd but um thanks for tearing yourself away from your wallets long enough to join us for dc thursday um we have a great lineup of shows and interviews for you this year and i'll be continuing to to guide you through that process um anna and i here have been researching and pulling together lists of folks that we want to join the program and last year was such a great success but we're all sort of close to our screens due to the pandemic that we've learned a bunch and and have a great line of programs for you for this year as well so um i just want to thank everyone for joining and uh mention just a couple of announcements before we start you recall that we do support live questions in the chat so you can feel free to drop your questions in the chat for our guests and we'll be happy to try to get to them time permitting and also if you want to be notified of the other great dc thursday shows that we have lined up for this year you can just hit the subscribe button on youtube um and also click the the bell button which is the notification button in case you want to get notified on your mobile device or on your on your desktop so today i'm really excited to introduce td das td is here from databricks and td has been um he has a really interesting history a 10-year veteran of the data world and um hails from uc berkeley grad school where he started to work on the spark project very early with mate who's the co-one of the co-founders of databricks so td has been working in the spark ecosystem for 10 years he's contributed heavily to spark as a pmc as a core contributor he's also taken the lead on spark streaming that sort of came down through the ages so td um welcome thank you very much for having me here it's a pleasure to be on this exciting channel yeah awesome to have you um and i just wanted to sort of dive in and chat with you about some of your early career experience because um as i mentioned you have sort of a a really amazing past and seeing the the birth and the evolution of spark and so i wondered if you could share with us a little bit about how you got into this world and what the start of your career looked like all right all right so yes as you correctly gave an introduction that it's been 10 years i've been working in the spark world so that 10-year journey started with me joining the amp lab in uc berkeley where spark all these projects started i joined as a grad student uh my one of my first projects was to help out mate who wrote spark was writing spark at that point of time help out him building that project testing it out doing running benchmarks typical grad school stuff uh and so that's when it started in 2011 at 321 right now uh then after the spark project mate and i together started uh brainstorming further to take uh to extend it for not just batch data but streaming data so that's where that's where the idea of spot streaming came into the picture that roughly happened at the same time when uh for those of you who remember apache storm was getting popular as the first open source distribution processing system and we saw that it was very limited in very obvious ways in terms of fault trials etc so we thought why not take uh the spark engine and push it to uh in a new direction to make it as fast as possible so that we can run stream processing jobs on it that led to spark streaming a single engine that can do both batch and stream processing and a consistent set of apis between batch and streaming the rdds and d streams for those who have played with sponsoring the old class park streaming before um consistent apis because there's a similar semantics uh between them uh does making it makes the user the developer's life much easier uh so that went on for next three four years till around 2014 2000 uh and within that time frame we saw spark streaming gain a lot of popularity becoming almost the de facto standard of stream processing uh and uh but then we realized working with hundreds and thousands of developers we realized the shortcomings of smart streaming as well like we have to solve the next level of problems and in the same time rust who uh started the spark sql project was building the spark sql engine so the natural next step was to kind of take the best of both worlds take this much more hyper optimized sequel engine and build a streaming engine on top of that to take advantage of the entire sequel optimizer and stuff and so that led to the the next iteration of stream processing engine the structure streaming engine since 2015 um and and that went down and and that now structure streaming has become a pretty much of a standard in stream processing uh so that went until 2018 to the 19. so somewhere around 2018 the same team that built in database that built structure streaming um work realized after working with again developers customers etc that solving giving nice apis um to make it easier to write your stream processing workload solves only half of the data engineers problems because ultimately data engineer wants to solve the end-to-end problem of writing end-to-end pipelines that takes raw data and produces data insights for that processing only one hour with the second half of it is storage of data in uh structured ways such that it can deficient to retrieve the data when needed and you get all sorts of uh data quality guarantees like transaction guarantees asset properties etc typically for last many decades databases have been the gold standard for it because it gave all those uh data quality guarantees out of the box but data lakes uh on hadoop ecosystem etc didn't have a lot of those properties and that's where we realized that we need to solve that problem as well in order to make the the end-to-end problem of building data pipelines easier holistically easier um so we started building this newer project our newest project called delta lake since 2018 which is basically uh a lot more structured but scalable storage format similar to parquet but imagine all sorts of asset transaction guarantees on top of it with while maintaining a full history of all the transactions that happened and you be being able to travel back in time to query previous versions of the data so all of those nice things that some of which you expect out of a database uh bringing all of that in into this data lake world so delta lake is our current most active project right now built by the same team that has been working on spark for last many many years got it well that's um that's quite a rundown of of projects across the decade um let's dive into the streaming stuff a little bit first um what what did you discover was um sort of the main use case for the streaming the early streaming systems that you were working on so the main use cases where uh so the earliest ones um had the the earliest ones as well as still continues to be the major use case is essentially etl you have lots of data coming in in the form of logs or sensor data and stuff and you want to be able to collect the data and store it for post processing and and you need a system that scales to large volumes of logs like gigabytes of data coming in per day per hour so be so on like if you dial back to the late 2010 uh late 2000s where hadoop map reduce just came in that solved the the hadoop ecosystem the hdfs with mapreduce solved the problem of storing very large volumes of data and processing on very large volumes of data in a scalable manner what it did not yet solve was doing it in as close to real time in a streaming fashion as possible that's where that led to the the advent of apache storm back then because that was the first system that could consume large volumes of data from other engine other streaming sources like kafka or flume um there were a few others rabbit mq etc in a scalable manner and storing it on let's say hdfs or something um in us in real time and in a scalable manner and so so that the data stored in hdfs can be queried later for india so that was the one of the most important use cases right from the beginning and still continues to be um and storm came out of twitter right um so that was sort of like indicative of the twitter style data flow which is obviously you know why they had to come up with storm in the first place exactly exactly exactly uh yeah twitter being twitter they were obviously one of the most real-time products out there and so they definitely needed the most real time and scalable engine at the back so that's why they came up with this and stuff which was the start of the whole big data stream prevention but where uh and we saw the rise of that but where it storm did not initially do very well was to provide end-to-end guarantees that if you have uh one record coming in from let's say kafka or from whatever source you're reading from you want that record to be exact present in the final on hdfs exactly once not duplicated not missing data uh it didn't do didn't do very well in providing those guarantees uh because the way it was architected first iteration the this is true for any tool which is the first iteration of an attempt to do something really challenging and that's where that gave us the motivation that we can do better and we can do it in an engine such that it solves a whole lot more problem because earlier people used to uh have two different stacks one the map reduce stack for batch processing and uh and the storm stack was team progressing with different code and causing inconsistencies between the produce results and stuff uh we wanted to solve that problem as well in in at one go so that's what led to spark streaming so that was a large that was a large design requirement of start of spark streaming is that it would support the exactly one semantics exactly exactly uh it's uh and so so if going back to your original question etl was one of the large use cases the other large use case was real-time alerting um which again storm being providing a very low latency millisecond latency was very good at when designing smart streaming we kind of realized that a batch processing engine even if we push it further down cannot further into like optimize for that it's really hard for it to provide uh realistic level latency so that's another conscious choice we've made that we we kind of conjectured that for a vast majority of stream processing engines be it atl beat alerting and everything in between second scale latencies was good enough for like 1995 percent of the use cases that's what our conjecture hypothesis was and we we went ahead and built spot streaming with that limitation that maybe we'll get to uh second skill latencies like half a second and stuff but not lower than that but that is okay the other benefits in terms of usability in terms of uh guarantees etc that we provide is going to be much much more useful for the broadest class of uh streaming problems so achieving and and so i guess that's the um that's the design decision that led to the micro batch um sort of concept is that right exactly exactly exactly and we we thought this will be good enough for most of thing and looking back the hypothesis that we made is still true eight nine years later since we made that hypothesis so yeah some some decisions were right back then and because we are still doing micro batch architecture and uh it's like it's still i can claim that 90 high 90s i couldn't put an exact number 90 to 95 95 or even higher amount percentage of workloads that we see in stream processing is basically second scale latency is good enough and do any of the other common frameworks support or sort of um are they built on micro batch technology or is spark streaming the main one or the only one so at this point spark streaming and the subsequent structure streaming uh are pretty much the only one built on in in a micro batch format uh because it's it's one of the only engines out there which really pushes has has this exact same engine at like the underlying nuts and bolts running both batch press bad jobs as well as streaming jobs in the exact same job task etc format which makes it easier to deploy and manage as well that you can uh if you have tuned the your setup to work for batch workloads you don't need any additional tuning to work for streaming workloads so that's only some one of the only engines that still does that so yes we are the only ones with micro batch again with the known limitation but does it really limit everyone leave it anyone if 99 of your use cases don't care about the limitation got it yeah i i definitely understand the argument um and then so then fast forwarding on to to spark to structured streaming so what was the main sort of design goal of improvement future improvement of structured streaming yes yes so um so smart streaming after developing it for three four years and working with developers we realized the two main flaws of the spark streaming's api design one was that the way spark streaming defined uh micro micro batches is based on when the data arrives in spark streaming so it basically is to cut off data based on the arrival time so if you're doing sort of windowed aggregations that give give me the average of data based on when the data was generated which is called the event time basically okay let's take an example you have data coming in from a sensor and you want to calculate the per hour or per minute average uh sensor value so you when you say per hour you want you care about not when the the hour being defined not on the time stem when the data arrived into the processing engine but you care about the timestamp when the data was generated in the sensor so that sort of that event time and windows or windowed aggregates based on event time was very hard to do in spark streaming because pakistani minor did not have the first class concept of event time and incorporating event time in the doing incorporating event time in doing aggregations was very hard in the smart streaming's d stream api that was problem one problem two is that d stream the api itself while the operations like the map reduce count those filters those operations were exactly same as the operations in the sparks rdd which is the batch side of things if the classes were still different so the user would have to do some rewriting of the code to run the same business logic between batch and stream which is which was a game changer coming from mapreduce and storm a very minor rewrite compared to rewriting map reduced to storm but uh still not the best so we want so that like we wanted to build something that takes the that allows the developer to write one code and then let the engine figure out that uh based on the intent of the user run it in one shot on a finite amount of batch data or run it continuously on continuously arriving streaming data but the code should be exactly the same no need to rewrite anything and that's those are the two main motivations that led to structure streaming uh that well those were the drawbacks of dc in the electrostructure streaming the other motivations of building structuring was to take advantage of the sql engine underneath so the sql engine the new batch data frame api is that essentially is a programmatic mirror api of sql and which all boils down the same logical plant and the cat sparks catalyst optimizer comes in to optimize it and spark sql engine runs it on with rdds taking advantage of all of that all the performance benefits that comes from all that optimization steps and stuff take advantage of that in us in to do stream processing as well so that was like a non d stream but a main motivation to build structure streaming on top of the sql engine and what we get what we got there for free is uh you can write sql you can reuse the same data frame api to write your business logic and then based on a few additional flag outside the business logic you can ask spark to either run it on a batch of data or run it continuously in structure streaming with continuously arriving data so that achieves the best of both worlds got it so so these were all the steps in sort of the evolution of the data processing um strategy of the databricks products and being able to sort of seamlessly embrace both batch and stream um was a key design decision that you made across across all of these enhancements so so then i guess the next step as you mentioned i think in in your earlier comments was um it's not only about being able to process the data in an efficient way that hopefully reuses business logic across even streaming and batch which you were able to do but the next stage is then what about the storage of the data so so did the concept like was the concept of a data warehouse sort of a clear and present thing in databricks um at this time and how did you how did you sort of sort through that absolutely very good question so basically for the last one and a half decade the if you think about how data storage has evolved the weight so before for the last more than half a century our databases and data warehouses built on databases has been the gold standard uh databases gave all the data quality transaction guarantees all of those properties out of the box what so it was great at that but what it was not particularly great at or it was not cost effective to do uh was to scale out and have a very large volumes of data like gigabytes terabytes petabytes retail was not built for that and that is what led to the development of the whole uh the original google map reduce and from that hadoop reduce hadoop file system and all of that as a completely rethinking of the entire architecture where the scalability is prioritized over data quality so and that's what led to the lake house the the the sorry the the data lake architecture where well that's why it's uh definitely what led to object storage and s3 like structures in the cloud first of all right which are exactly very simplistic but um but massively horizontally scalable exactly and so and and so i'm sorry i cut you off but you were you were getting to the the the part about the the lake house so yeah but it before you get to that it's important to understand the pros and cons of these two alternate realities is one is there were data warehouses on databases which were good at transactional guarantees but not scalability and data lake was exactly the opposite was built for scalability but did not provide transactional guarantees or any kind of data quality checks out of the box so for the last uh bet more than a decade companies who deal with large volumes of data has had to maintain these two alternate stacks uh for very high profile data which needs those guarantees etc you they maintain a data warehouse and databases for larger volumes of data which doesn't which is mostly append only because and doesn't need that much guarantees or doesn't need to be that much reprocessing guarantees uh they maintain a data lake uh over the last two three years we are seeing a convergence of this exactly mimicking in the same way spark and spark streaming led to a convergence of the processing stack uh between batch and streaming being merged together into spark streaming streaming we are seeing the similar pattern between these two storage stacks data lakes and data warehouses and delta lake is one of the tools which is kind of leading the charge of being able to combine that because delta lake is built on the same principles as data lake as in fully scalable storage rights and everything but takes uh some of the fundamental core principles of databases like maintaining a transaction log and which is atomically updated on every operation takes that and applies it on the principles it kind of combines the best of both worlds in an attempt to get you the scalability of data lakes and the transaction guarantees of data bases and data warehouses the lake the lake house that's the new architecture we have been proposing and driving with delta lake being the vehicle of that uh is the lake house architecture that you do not need two different stacks you can build one stack on top of all the data lake tools like s3 and all of these infrastructure that you are maintaining for data lake but you can directly leverage all that scalable infrastructure uh to do all the computation that you have been traditionally doing on data warehouses move it over to data it does end to end simplifying your data storage architecture as well and and and as i said that is like the second half of the your ultimate data engineers or data scientists ultimate goal to derive insights from the raw data so over the last decade if we have spent better part of it simplifying the first half make processing easier scalable efficient whatever uh now we are working to make the second half easier by completely uh changing how storage should be done yeah and uh building on top of it and that makes perfect sense um so what about the actual storage uh the file format does does a lake house require a universal file format or does the data still sort of live in its own disparate file formats depending on the system so so before delta lakes the the common file formats that were used for data lakes where things like park a orc j7 and stuff and the more structured it is like par qrc the better the performance you get so it was recommended that using but the common thing across across all of this is that these file formats were open file formats so any processing engine any tool could read it this is in direct contrast with databases which essentially maintain their data in a very highly optimized but internal format that only the databases engine can read and nothing else so and and that was another principle tenet of the data lake architecture that file formats will be open so that any processing engine can access the data in parallel without relying on a single agent to provide the provide the data [Music] so in delta lake to continue with the same trend what we do is that we store the data underneath in parque file format so the individual files are formatted by parquet but uh to read the data you need to you can to read the data correctly or um you need to read the delta's transaction log first because the transaction log uh maintains a log of which are the files that need to be read for you to get a consistent snapshot view of your delta lag table so you need to read the log first which is also stored in the same file system therefore the log is also scalable and the and you the log data is also maintained using parki and json so again the same engine can spark can scalably read the transaction log if the log is large and allow the system to scale to millions and tens of millions of files uh so you have to read the log first and then only will you know that which actual parquet data files to read so the summary of what it does is that if your system if your pricing engine is uh can understand the delta lake format basically it knows how to read the delta lakes transaction log uh then you get the best of both worlds you get the transaction guarantees snapshot isolation you get considerating consistent snapshots asset guarantees while being able to access the data using open formats like parque and stuff uh and on scaling and the data completely distributed so so this is this is a fascinating architecture um and i guess to be able to support all of the acid-like guarantees um some notion of schema i mean at the same time this horizontal um sort of distributed data scale there must be a lot of tricks involved to doing this and i guess one of the one of the complaints i've heard about hootie or iceberg or these types of systems is that they're quite complicated um to run and maintain like what's what's your answer to that so um so i think though ultimately our goal with this project is that the the developers should not have to think about the complicated problems we want to make some of the problems like easier and natural for example one thing that we majorly do is in terms of schema is we have schema enforcement the way it works is that okay think about if you're using a parquet table it's there is nothing blocking you from accidentally writing a bunch of parquet files in a particular table directory which are of a different schema and format than the rest of the data and then you your table is essentially corrupted because you have data in inconsistent formats so and the major reason that happens is that the tools that are writing into the table does not have any knowledge of what is the existing data in it what is the schema of the data present in it and therefore it cannot check what it what it's writing is valid or not that's where delta comes into the picture where because it maintains all of that metadata information in the transaction log before doing any right it can successfully enforce that is the data being written in the correct schema or not and then it can take it a step further that and this is where making things simpler for the developer comes into picture is that it can do automatic schema evolution for example if you want if you have new data coming in that has an additional column which is in that new schema which energy column is perfectly consistent with the old schema because you're just adding a new column of data it will identify that and automatically evolve the schema of the table in the metadata that oh now the table has a new column and that's okay so and the user doesn't have to like think about it how you would have to do it in the pure sql world in the old school database add the column and only then you can write it here it can be a lot more seamless that's a great feature flip a flag that turn on schema evolution and then if you have a streaming pipeline that is continuously writing data into a delta lag table you flip that flag uh for schema evolution and if your data happens to have new columns added and stuff it will automatically get added without your manual intervention does making the your the developers the data engineers life much easier that it doesn't require manual intervention but it is safe because for example if the data comes in where it is incorrectly changing the data type of a column it will block that it will fail when it is supposed to fail because the schema change is not correct but if it is it will allow things like uh new columns added etc when uh when it is safe to do so and these are some things that we we made a very opinionated policies on what schema changes are allowed and what are not in order to minimize downstream confusion that is something uh that i we believe delta has done it better than other systems like hoodie and iceberg it's got it got it um great well that's uh you know that that's a that's a super compelling vision um and i want to just take a question from one of our audience my inc who's asking can you help with understanding the benefits that a lake house will bring if we want to power mlai solutions so how does the lake house sort of apply to the machine learning world absolutely great question mayank so it i mean that is ml and ei is one of the biggest motivation for unifying these two data lake and data warehouse worlds into lake house because think about what used to be what it used to be earlier so data that was locked in the database data warehouse ml systems could not read the data because only databases could read that data so the way that people used to integrate data warehouses and is to explicitly export the data from the databases format to an external format like package.json uh and uh only then uh would it uh give me a media i'm getting a call for let me hand this sorry about this okay thank you uh so where was i so yeah so people would have to export the data from the database into an external format like park a json csv whatever and only then the ml systems could read it which is extremely time consuming resource consuming process to export it every time because ml systems could not directly read the underlying data in databases with data lakes and lake house you can have this or you can have the best of both worlds you can have a very fast efficient sql query engine just like databases accessing the data but ml systems can also directly access the data without the need to export and stuff because that has been filed because of the open file formats exactly because and that's the point of lake house that you can have one single infrastructure that drives and allows any processing needs any analytics needs to be built on top of it be it sql for b analysts or be it mlai for data scientists all in the same way in an efficient manner so in in practicality are you seeing that this this vision is actually being sort of embraced by the market our people sort you know were before they probably had a warehouse um you know and maybe a data lake sort of um just by default because people were storing data in s3 and all kinds of different places they probably had a a time series database maybe um or some other kind of olap database like they had you know all these different data structures are you seeing all of these converge and people are sort of spinning down systems that used to exist and essentially migrating them to delta lake the the change is it's it's still a very nascent concept i think as a technology not just data breaks but we are not there yet to kind of completely move i think we are building that it obviously takes years for any technology any new uh technology a new paradigm to mature i think but the change is happening i think this in 2020 we have seen in our customer base a lot of use cases that were traditionally done using uh dedicated data bases or dedicated key value stores uh being transitioned over to something like that to do delta lake to make it more cost effective and cost effective scalable and efficient so that change is happening but it is still the early days but it's good to know that there is uh this is the trend that we are going to see for the next two three four years that there is going things are going to be converging more and more got it so what what about the immediate product roadmap um for delta lake what sorts of major enhancements are you working on now that you can share with our listeners that they should be prepared for in the near future so absolutely good question i think some of the major work that we are doing in delta lake are things like like pushing the boundaries of schema evolution and schema evolution is something that very few systems can effectively do and but it is as i already explained how it can make a developer a data engineer's life much much easier with making it more like much less manual intervention needed uh whenever your data types or data schema changes so scheme evolution is something that is a very hard problem and nobody has done it effectively in the past so this is where basically the cutting edge of that of doing that and so where things that we are working on is pushing the boundaries of that scheme evolution for example uh one of the things i'm personally working on right now is delta uh unlike other systems so directly supports the the merge sql operation uh which basically for those who are not familiar with what the mod sql operation is it allows you to absurd data so if you have data based on a key if the data is so imagine you have your computing aggregates uh based on so which is essentially key value data right you want to inject or for keys for which the values don't already exist you want to insert those q values and for keys where the value already exists you want to update them so the simultaneous absurd operation for a set of key values can be done using this merge operation now delta lake is one of the first sort of scalable systems that supports the mod sql operation directly and and schema evolution on top of merge is one of the most cutting-edge things we have done and we're still continuing to work on it and that is something no one has even come closely so traditional database some of the traditional databases do support merge operation in fact advanced syntax uh of mods operation but nobody supports schema evolution on top of mod like while you're upsetting you can evolve the schema so these are the kind of really cutting edge stuff we are working on right now and we'll continue to push the boundaries for the next few years great well we have another question uh from rick rick is asking how is spark trying to re to smoothen the reuse of algorithms i.e jax or pytorch in a lake house environment that's a good question and it's complicated one so let me kind of elevate that question into the general problem that data scientists face is that they want to get the best insights or to build the best machine learning models from their data so uh and for that the the ml world has realized uh that uh that there is always that no no single tool can provide the best machine learning model in all situations so they for different problems they need to build different machine learning uh they need two different machine learning tools to get their best machine learning models so that means that there is a need for a platform that can uh accommodate whatever tool that the user wants to use as easily as possible which means uh and now that that need translates to uh to how the platform needs to be architected at different levels so at the spark level what we have been doing over the last few years is that may we have made uh our some of our ml apis a lot more non-spark friendly for example uh we have built these sort of we have modeled our apis in a closer to how scikit-learn apis look like so that users can seamlessly transition their code between scikit learn and spark ml lib to make it easier to try out both tools then in a different approach people who are familiar with pandas uh the pandas data frames we have built uh pandas data frames uh in python we have built uh which has a different set of apis than sparks data frame we have built this new project called pandas sorry project for pandas called koalas which unifies the apis and gives a panda equivalent api uh for spark data frames so that just changing an import allows you to write take your pandas code and run it on spark so we on at the spark layer we are making it easier to integrate and work move back and forth seamlessly between spark and other non-smart products to make developers life easier so that they can use whatever tools it is that best suits their needs there isn't one tool that satisfies everything that's and one one level of what we are doing to make things uh more you developer friendly at a higher level if you think in terms of the anti-analytics platform that databricks provides we are making data bricks also a lot more non-spark friendly by incorporating all these other tools already in our platform so that you can have so that users don't need to manage their images or their developer environment etc to bring in pythons bring in uh tensorflow bring all of that explicitly our platform provides that out of the box uh already installed and stuff so the moment you start you can use our notebooks or our platform to run uh spark ml lib jobs as easily as uh tensorflow jobs because the platform provides both so so so these are good we're tackling at different levels so these enhancements aren't specifically due to the lake house um structure of the lake house architecture it's just sort of a general data bricks product strategy decision that you want um to enable developers to run other machine learning algorithms sort of easily and have them easily plug into spark execution yes yes uh well i wouldn't say it is not entirely tied to the lake house stuff because uh while we are so i think these are parallel attempts we while we are making it easier to have one platform to run both kind of jobs tensorflow as well as spark ml lib we also need to be able to enable these other tools to directly read the delta lake format so that the these other tools can directly independent of spark read the same data that is present in the lake house so we're definitely working on making our uh the delta lake format more openly accessible and integratable into other tools by right by by providing apis to so that other tools can write connectors to directory the delta format independent of spark so that is also something we are working on in parallel so it's not exactly very intimately tight but they are not independent either so but ultimately we want the data to be democratized so that anybody can read it from any tool and and have a single platform where you can use anything to read that data got it well i wanted to switch gears a little bit td because um you have such great experience obviously with so many different types of architectures and trade-offs and tools and systems in terms of folks out there who are looking for career advice i'm curious you and i have spoken previously about sort of the difference between the data engineer who consumes the tools that you're building and then your type of work which is maybe a little bit more tied to distributed systems um can you give any folks out there who are looking to sort of deepen their their appreciation of distributed systems and other fundamentals that they need to be able to think architecturally in the way that you're thinking like what what would you encourage them with and how would you encourage them to just sort of study up and get started in deepening their knowledge in these areas that's a very good question um so yes so i am more like a system engineer who built these the tools who has been building the tools that the data engineers actually use in practice so they are the actual heroes of their organizations who are using the tools to actually solve the actual problems now um the current state of affairs is that there are so many until the world has converged in on these different ways right now there are so many different tools out there that it is often very hard to understand or what is the right tool to use for solving your problems and that is the real problem one of the real problems that the data engineers face on a day-to-day basis like how do they design and architect their data processing pipeline using what is the the right set of tools to solve their problems um and given this variety of tools it's it's i think it's become more crucial than before to understand these each of these tools with a little bit more in depth at the architecture of these tools behind the scenes behind the api is to really reason about what is the right tool for your purpose and so i think it is absolutely important more so now than five ten years ago to for data engineers and data practitioners who are designing these pipelines to spend a bit of extra time and effort to understand these tools a little bit more to understand what the the system architecture and what each of these tools are not go what they are good for and what they are not good for in order to make more educated decision when they are designing their stuff um and and i think that is something that can differentiate a good data engineer from a great data engineer so it's it's it is not to trivialize a problem but it is easier to do build a solution using whatever tools in a quick and dirty manner that works once and scales out to future uh downstream users that that does that that works one time for your current needs but it's a whole different challenge to design it to make it future proof for future downstream use cases uh that needs to work using the same pipeline and making those kind of architectural decisions that are future proof requires that in-depth understanding so i highly recommend uh data engineers and data practitioners to building pipelines to spend the time and effort to understand these tools a bit better and um i dug out a talk that you gave td that i think might sort of help further folks deeper understanding of some of these concepts um on designing etl pipelines with structured streaming in delta lake um so we'll we'll drop that in the chat but i think that that's a good example of a talk that will help people go a little bit deeper and really start to ask um you know sort of more more prescient questions specifically regarding what they're trying to accomplish and and start to consider the architectural trade-offs of different choices um do you have any other advice for for folks who are early in their career um i guess specifically someone in the in the audience santosh he's asked how do i become a data analyst but i'm a mechanical engineer what do you think about people sort of transferring into data from other engineering disciplines like have you seen any tried and true sort of path or type of education or mentorship that works so absolutely i i personally have a number of friends who have done that transition from uh other non-cs background to data analytics data science backgrounds i think it's it's very important uh to uh i think the the the right way to do it is starting from uh finding the right materials to dip your toes in uh data analytics i think there's a lot of material out there there are moocs and coursera edx and stuff uh databricks we also provide training those those tutorials which are those specific to the spark and delta lag and that kind of platform but uh there are also uh tutorials that are geared towards towards data uh like the beginner data analytics there are moocs built on top of database platform for doing that as well so there's a lot of material out there that helps that starts from absolutely the scratch on how to start becoming a data analytics person um it's it's uh so i think the the challenge is to put in the time and effort and the patience to start on that new path i think i've seen uh people who have uh been impatient and not been successful uh in doing so but people who have been patient to power through uh the initial hurdle and become very successful data science data analysts so uh so power through i think that's one of the things i would like to really encourage that it is incredibly rewarding at the end and people do make that jump successfully but you have to have the patience because the start will be slow and that's okay it will it is worth the time putting the time in effort that's right that's well said well um we're almost out of time td but i just wanted to ask you one parting question and that's in regards to the data ecosystem at large we saw such amazing things happen in 2020 there's many startups that are popping up and getting um increased amounts of funding and um you know so many companies in the space for doing well i'm not just theatre bricks but confluent and and others um i'm just curious besides this whole sort of delta lake um and and lake house effect what are the other trends that you're excited about for 2021 and the ecosystem at large that's a good question um let me think i think there i think convergence is a common theme convergence in terms of tools and architectures like for example kubernetes for example is a one big trend i've seen like earlier there were 100 different ways of managing your micro services on you know raw bare metal or vms and stuff um now kubernetes has is becoming uh like a single singular way of managing these uh and all the clouds uh support some form of managed kubernetes service so which is a proof that how the right architectural decision or the right tools designed with the right goals in mind can lead to convergence of different fragmented tools and so these are the kind so convergence is i think that over the last decade we have seen explosion of tools not just in data but in terms of infrastructure and stuff i think the next decade is going to be more about convergence of these tools and infrastructures yeah i couldn't agree more there's so much fragmentation in the data world and i spent some time last year writing a few articles about that um there's so many different tools and i think that um you're absolutely right that there there obviously has to be some convergence and it's important also as you mentioned to really understand and be able to weigh the underlying architectures and trade-offs so that we start to merge these things in a meaningful way that's actually future proof and that doesn't just solve the volume of data that we have now but solves the volume of data that we anticipate um in five or ten years which will only explode through um iot and healthcare data and all the other types of data that are sort of coming online with these systems that we're building now so yeah is just the first frontier and now i believe that this this whole big data movement is i'm spreading to the rest of the world as well and so um we really haven't seen anything yet in terms of scale yeah well thanks td um it was great great to have you here thank you very much for having me here this was a very wonderful conversation i appreciate it um so just a couple of comments as we um wrap this up please leave us comments in the feedback forum that's in the chat that we've posted it helps us know what you like know what you don't like and know what we can do better and finally if you're an engineer who's interested in starting a company i personally am having office hours every month with engineers in our community who want advice in getting their startups off the ground so um please reach out to me directly and let me know if you're interested in participating in one of the office hour sessions those are designed for the community to support other engineers who who want to start companies in 2021 so remember to hit subscribe and we will look forward to seeing you for our next dc thursday thanks thank you everyone and bye bye and stay safe out there

Info

Channel: Data Council

Views: 607

Rating: 5 out of 5

Keywords: data engineering, data pipelines, data catalogs

Id: 4tAGmKqKGh4

Channel Id: undefined

Length: 58min 38sec (3518 seconds)

Published: Thu Jan 07 2021