Intro To Databricks - What Is Databricks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys and welcome back to another video with me ben rogezon aka the seattle daily guy well as i just got back from the snowflake summit i feel like it's only appropriate that i do a video about data breaks so the focus of this video is to answer the question what is data breaks and why do people use it when you look at the fact that databricks is recording 800 million dollars of revenue in 2021 it's got to make you stop and wonder where in the world are they going to grow to next and since there have been a few times that databricks has essentially passed the value of snowflake based on their vc funding and valuation it makes you wonder which tool is going to win out in this battle of what people are calling data lake houses now arguably that whole concept i think was brought out a little bit more from databricks but both solutions are trying to sell themselves as data platforms and not just you know a data lake or not just you know a data warehouse on the cloud they want you to know that they are so much more so let's dive into databricks now dataworks itself wasn't started till about 2013 but much of the development into spark itself happened far prior there's actually a few research papers you can pick up including one on resilient data sets which is kind of the focus or kind of what spark is developed around um it's basically processing and what you're often gonna hear is rdds i'm going to put up the paper here as well as link for anyone who's interested in learning more but basically it was developed by some professors at uc berkeley and eventually like anything else that is difficult to manage people eventually wanted an option for managed spark services if you're familiar with aws emr or gcp's dataproc that's essentially what you could do you could set up spark jobs using those managed services but what if you went a few steps further that's where databricks comes in databricks is not just one open source solution but in fact it's multiple at its core in particular it's spark delta lake and ml flow in particular spark is pretty much unavoidable you're going to use it whenever you're processing data delta lake and set up delta tables so that's something that we can dive into in a second video and ml flow again is more of an option uh for those of you who haven't worked with mlflow it's basically going to take a lot of those questions you have if you're a data scientist in terms of how do i deploy this model um that's going to be your answer for a lot of people it's going to take care of like model registry model deployment some model monitoring a lot of these things that we don't always know what to do with right like you're like i've developed a model now what do i do with it um mlflow is one option another option you might have heard of is kubeflow so ml flow is what databricks uses as well as again delta lake and spark but again most people are going to most likely interact with the spark layer mostly but in a way that is very friendly for any data scientist or data engineer because they've set it up in such a way that if you're familiar with jupiter notebooks you're going to do great so let's just dive into spark really lightly so you can kind of understand what it is what it's doing and what's the whole focus you know what is an rdd so apache spark was started in 2009 at uc berkeley at amp labs with the goal of balancing uh fault tolerance and scalability that often you find with hadoop in a solution and the goal of spark was to balance the fault tolerance and scalability of hadoop while also providing that ability to essentially reuse sets of data across multiple processes now i think it would be a miss if i didn't go over data lake houses because clearly uh databricks has decided to bet uh on this horse and basically every ad i've ever seen um for databricks is often poking fun at the concept of a data warehouse because what they are viewing in terms of the future of development and data management isn't a data warehouse but instead a data lake and both snowflake and databricks have their definitions in terms of what is a data lake house if you ask snowflake what is a data lake house they're going to define it again as a combination of data warehouse and a data lake and trying to find the benefits of both you know the cost effectiveness of a data lake with the data management kind of benefits that you get in a data warehouse things like security and just clear table structures that make it easy for analysts and future developers to actually approach the data and it's not just a bunch of files that you know someone's going to figure out what exists where one thing i do think is interesting is snowflake does seem to try to push more towards the data science use case for data lake houses whereas you know databricks is clearly saying this is everything this is sql this is business intelligence this is real-time analytics you know that's kind of the difference that they're trying to sell here my personal impression of databricks which is again trying to sell more of this data lake house architecture is that it really is geared heavily towards data scientists it's not to say that it's not built for data engineers but there's definitely you know with everything being focused mostly around notebooks and we're going to talk about some of the different features that they offer but mostly around notebooks that's just so quintessential to most uh data scientists that that's the feeling i get there are other things you're going to learn about such as jobs such as how you can actually structure tables and stream things directly from like your kafka instances that do start playing into this whole micro batch streaming batch etl processes that are very familiar to any data engineer so if we dive into databricks and i'm going to just show the kind of key components you're going to work with in databricks and then we're going to dive right into databricks but all of these components are essentially the core tools you're going to use when you go into databricks so what you're going to get used to is the concept of workspaces notebooks tables jobs clusters and libraries there's a few other kind of components but i think those are the main ones that if you're a user you have to become familiar with tables are interesting because it's kind of this abstraction of often files but in many ways that's all tables are in a lot of our modern architecture anyways where there's just kind of the difference between a table and a file is is diminishing and schema is really becoming more and more on read even in the snowflake world so let's dive into databricks itself so diving into databricks ui we're going to go into some of the different components that i referenced earlier workspaces i think is decently self-explanatory this is going to be the space that you work you can either be a specific user or you can create a shared workspace so if you have a team you can create the shared workspace or if you're just by yourself you can create a user workspace next you can kind of see a lot of what you can do in databricks using the create button so the create button basically lets you see that you can create notebooks tables clusters jobs and a repo clusters generally pretty self-explanatory if you understand spark basically it's the amount of compute essentially you can think about it as that you're going to select in terms of how many workers you're going to have in terms of how large the machine is that you're going to use so for example here you know if i create a test um you know you can have a standard machine you can also work in terms of either having single node or high concurrency you can then select your spark version so more than likely you want to pick uh whatever spark at the time is fully supported and from there you can pick how big your machine's going to be uh obviously the larger your machine the more data you're processing the more expensive it's going to get so for now you know keep it as small as possible you can also set it to terminate after inactivity which is great again trying to reduce cost if you're not using your machine and then from there you can just create cluster and then in the future as you're like developing uh different notebooks or jobs you can tell it to point to different clusters um kind of as you go along next let's go over tables so this is kind of what it sounds like a little bit um the interesting thing here is that you are going to be dealing with tables at the abstraction of almost pretty much a file which essentially is in a data lake or a data lake house kind of the context in which you do deal with tables anyways but it really kind of i think makes it very clear that that's where you're dealing with things here so you can either just upload a raw file like a csv here or you can pick a data source so you can pick something like azure blob storage um you can do s3 if you're using aws but i'm using azure you can pick kafka and from here you can actually create a notebook to see how it will import said table now with tables it's important to understand that there's different uh abstractions of tables some can be external some can be internal some can be backed by delta delta tables gives you a lot of the benefits of like acid transactions and things of that nature whereas normal tables will not and there's also again more to do in terms of like internally externally managed and i'll put up a quick chart here to show all the differences between different types of tables in databricks once you've created a cluster and a table you can more than likely now go into notebooks and create some sort of notebook and the great thing about databricks is it gives you a few options in terms of what type of coding language you're going to use in your notebook you can use python scala sql or r so let's just do r test yeah i'll just do our test so we're going to create this on this cluster that i have and this will create a notebook that you can work on and here's where things get even better is a lot of data scientists often wonder how do i actually you know put those notebooks into production and it's ready for production you can go to create you can hit create hit job and now you can actually take that notebook and make a job so if we go here whatever we can call this our test task we can select the notebook by going to my users our test confirm uh from here we can just hit create and what you'll see here is that it not only lets you kind of create this task but you can create more by hitting this plus button which will then if you hit plus create a dependency so for people who are trying to figure out how to create dependencies or just their general jobs anyways this is one thing that databricks will take care of not only that it will let you again hit schedule so you can decide when you actually going to schedule that job so hit continue scheduled you can set it to some sort of timing and that can be very helpful and for people who want to do a more complex set of timing you can also use cron you can hit save to schedule that job you can also then connect with git so if you want to you know have some form of version control that is also automatically integrated to all of databricks again this is something that i really do enjoy about the databricks kind of developer experience is that so much of it is integrated um and i do kind of wish snowflake was a little more like that where a lot more of your workflow was integrated into one and you can also swap out whatever configuration of spark cluster you're using so if you're deciding that this is no longer sufficient or maybe it's too much um in terms of cluster size you can resize that and and fix that so that's really what's great about jobs it allows you to basically take a notebook and essentially truly productionize it all these components are really great because they make data science workflows i think much easier to productionize you know if you are really comparing data bricks to snowflake you're not going to get that in the current setup of snowflake and i do think snowflakes trying to you know add more uh with a lot of their ability to start running python on your data but in terms of native functionality the fact that databricks use spark as their underlying processing engine is what gives them this ability to run multiple uh different languages in terms of jobs and notebooks as well as gives them this ability to kind of create jobs in in this way you do have tasks in snowflake but i i really always kind of wish that those tasks were somehow more like this somehow i could actually see them in the ui and create them rather than having a purely sql-based approach anyways guys this is your intro to databricks uh if you enjoyed it let me know in the comments below maybe i can start creating a few jobs here if you guys are really liking it anyways i will see you guys next time thank you and goodbye [Music] you
Info
Channel: Seattle Data Guy
Views: 83,514
Rating: undefined out of 5
Keywords: apache spark, what is databricks, why does everyone like databricks, what is a delta table, what is delta lake, intro to data bricks, intro to databricks, intro to apache spark, intro to spark, how to learn databricks, databricks tutorial, databricks tutorial for beginners, databricks vs snowflake, databricks delta lake, databricks demo, databricks azure, seattle data guy, data engineering, data science, scheduling jupyter notebook, python, scala
Id: QNdiGZFaUFs
Channel Id: undefined
Length: 12min 28sec (748 seconds)
Published: Fri Jul 01 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.