Master Databricks and Apache Spark Step by Step: Lesson 1 - Introduction

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome back to my channel my name is brian kafiki and this is lesson one which is an introduction to an exciting new series i'll be doing on databricks and apache spark and i'll be doing these in parallel because as you'll see these are really overlapping topics databricks is something that uses apache spark under the covers there is a lesson 0 which you may want to go back to which provides an overview of the series what my goals are and where we're going what will we be talking in this particular episode is what is apache spark what is databricks scaling up scaling out with barry the weightlifter my kitchen drawer in the apache hadoop project understanding apache spark and data bricks and that's where we'll go a little bit deeper i'll be using a lot of information from my new book called master azure databricks step by step which is available on amazon and i'll be drawing from it but also be going in some ways beyond it and in other ways maybe not in quite as much depth so let's start with a very simple question seemingly which is what is apache spark well the simplest answer is to say it's the most actually popular open source big data platform for data science now when i say big data this includes things like streaming video images structured and unstructured and of course also a large volume of data that is typically something you couldn't handle well with legacy technologies the thing also to remember about apache spark is while it provides an awesome powerful platform for your data processing it does not give you any extra tools so you have no ide your development environment integrated development environment doesn't give you anything extra it's just the raw bones and it has pretty weak support for collaboration finally it's not optimized for the cloud and if you're paying attention to what's happening these days everything's going cloud so spark is a little bit behind in that it can run on the cloud and in fact it does and we'll see that as uh like for instance hdinsight on azure but it's not optimized it wasn't designed for that environment now let's talk about databricks most important thing to understand about databases possibly that it is a commercial product not open source that was created by the developers of apache spark there's the company called databricks and the product they have is also called databricks so these people the people who founded that company created databricks and databricks is really meant as a complementary service around apache spark it's a complete development environment designed to get you up and running quickly with spark it has numerous proprietary spark enhancements it's an ideal for data science team collaboration now when i worked at microsoft and i would pitch data bricks to people the sort of immediate nutshell i would give as an explanation of what it is it's exactly that this is designed for any size but large team collaboration is especially good for it data science team collaboration and when you think about data science teams you really have to think of a diverse set of people it's not just data scientists you'll have data engineers you'll have domain experts in the given business especially when you think of like healthcare or something there's going to be people who really understand the topic you're going to have business analysts maybe extra programmers and people focus on like deployment and devops and then of course key to this whole thing would be people are more like statisticians data scientists it's designed for bringing that all together and maybe multiple data science teams there's a lot of powerful tools that databricks brings to the table as i mentioned spark has nothing really there so this is kind of really handy and database is optimized in fact it only runs in the cloud so if you say i want to run databricks on premise can't as far as i know i've looked around you can't do that so that's one of the things you should realize now that we understand the difference spark and then databricks which essentially is a i think that is a big wrap around spark giving you jump start access and tools let's talk about what do we mean by this big data platform what is it doing that's so special well to explain that i'm going to use this analogy which i'm going to call scale up scale out and bury the weight lifter barry here represents the legacy approach to scaling which is called scale-up back in the day way back you know trying to talk like an old guy here i don't know if i'm doing it but anyway back in the day the um the way that machines worked was the technology kept advancing far outpacing the need you'd find that you have a computer and you feel like you're pretty good it does everything you need and there'd be new chips coming out with faster passive processing and it was happening all the time and the memory capacity of what you could put on a given machine just kept going up exponentially and it seemed like this would never end but it did because what happened was we reached a sort of point where we couldn't expand fast enough for the needs especially in the area of big data you started to see things with the internet services for instance imagine trying to process the web logs if you've got all that tracking going on for say amazon it must be trillions of rows probably billions per day maybe more and there's no way in the traditional tools or environments you could process that and companies like google also had this problem before almost any other company did but now this problem is its own pandemic it's a problem across the globe being able to support all different types of data video imaging sound uh log files that have billions and trillions of rows and on how do we handle that what do we do well guess what the scale up technology barry the weightlifter would normally just add more beef right more muscles more girth he can handle more loads but we reached the cap where barry can't do it anymore so a while back someone had the idea why not just instead of having one person try to do all the work what if we took a given job and we partitioned it we broke up the job and then we had a group of people all work together to do it and this group of people didn't have to be like very big powerful weightlifters it could be ordinary people so let's see how that works imagine we have something like a phone book and what we want to do is find out everybody who's living at main street they couldn't you don't even care what main street where it is you just want anyone who's living on the street named main street you go look through the phone book now of course if one person did this this is going to take a long time they're going to have to go through the home phone book but what if instead of doing that you said we're going to break up the phone book we're going to separate it it's already sorted handily here by last name so what if we just take the first letter of the last name and we break up the phone book that way one person gets the a's the b's the c's and so on then we ask them to do the search now each person is going to have their own section they're going to tally up how many occurrences they find and then there's going to be the person that requested they're going to collect back the results so they're going to collect back all these sheets with the numbers and of course that person collecting it's going to have to sort of tally up the tallies to get the grand total so if you think of that that's called scale out and if you think of instead of people replace that with machines then that's your scale of technology and you can think of these machines they call them nodes and they don't have to really be physical machines but if you think of them as separate running processes that's good enough well somebody made a request of them to do this right they said this is what i need you to do guys get out there and do it that's called the cluster manager so the cluster manager made the request and those results go back to that cluster manager to be consolidated and handed back to what's called the driver context or the context so that's the idea the context basically you're me saying i need to know this and it kind of goes through these layers i'll show you more of that but that's the concept the most important takeaway here is instead of one machine doing the work the work is being distributed among any number of machines any newer parlance or technology you might even say containers but we don't have to worry about that but just think of separate processes running on separate machines as opposed to one i think that's really the main thing the first major open source product that at least i'm aware of that did this was something called hadoop with a cute little elephant which happens to be named after the creator's stuffed animal of her of his child which they couldn't pronounce it so they said hadoop instead of elephant not sure how they got that name but nevertheless it goes down in history so we have to do hadoop is the first big data platform open source available what happened over time is other services and tools were added to this hadoop to sort of say we want to add contributions to we want to add other services that can be augmented with hadoop or maybe you want to create big data services that have nothing really to do with hadoop per se so they broadened hadoop to be the apache hadoop project which caused some confusion because we had a single product that was hadoop but now we have an entire project which is hadoop so this there's a separation which we're going to see but the apache do project got kind of confusing we got a lot of things in there and that reminds me of my kitchen draw i have a kitchen drawer or a junk drawer and you throw things in it your spare keys batteries old phones and calculators etc it's just kind of miscellaneous things you got nowhere else for and it reminds me a little bit of the apache do project because that has a lot of things in it now and i want to prepare you because this is a subset of what's in the hadoop project today so looking at this the original project is in the core and it's going it now goes under the name of just mapreduce so you don't say hadoop anymore you typically say mapreduce when you mean that core set and the first thing at the top is that architecture that idea of spitting out partitioning and then getting results pulling it back to parallel processing underneath that you need something to coordinate and acquire all the resources and jobs and memory and things that are going to happen to support mapreduce that's called yarn which stands for yet another resource negotiator weird name underneath this is a really intriguing service which acts like it's just a regular file system you might be used to right folders files etc but it's got a little twist to it it understands this partitioning and while it doesn't necessarily show it to you it's partitioning the data in on the drive so that you may look at it and see one big phone book file but in reality it's partitioned by say like we talked about last name so yeah that's called the hadoop file system also known as hdfs hdfs is ubiquitous now it's used throughout the big data world and typically when you think of things when you talk about big data storage even on the cloud underneath it using this hdfs type of approach so that's the sort of core services now one of the problems with mapreduce is they didn't think it out too well in my opinion i think these were added later as sort of oh yeah we want to have machine learning so they added mahout and then they realized it was really difficult i kind of gave up to be honest using hadoop mapreduce because the way you had to do it was using java and some people said you could use other languages but all the documentation was centered around java very complicated you had to understand what you're doing it was not separated from the implementation in terms of how you worked with it so you had to understand that you are splitting out your data that you're submitting it into many layers of batch jobs it was not intended for real-time processing was meant to be something batch-orientated that's what it is and so hadoop had that and then they added these other things high was added so that instead of this complexity it kind of hides that and you can use good old structured query language and then pig was added to support etl so that's kind of the original sort of services later on uh distributed system coordinated zookeeper was added we have streaming it's kind of a separate project but streaming support with storm we have lucerne which is kind of like google search for open source we have it we have no sql databases hbase and cassandra we have streaming with bloom we have queue service with kafka and we have this sort of move in and out of sql databases with scoop so we can see there's a lot of different things going on here and as i said it's kind of like this project just growing and adding yet more things into our kitchen drawer but what we really want to focus on what my series will be focusing on is this pink box i put up here to stand out which is spark there's a few things i do want to emphasize with spark one is i think for the most part it's fair to say that spark is a replacement for hadoop mapreduce at least in the historical context mapreduce because it physically reads and writes to storage that's a bottleneck and anyone who works with databases knows that when you read and write to disk it slows things down that's going to slow everything down so spark was designed from the beginning to try to work in memory as much as possible so if you think of everything all those partitions are pegged in memory using a cache and it will be held there as much as possible if you have enough space it will not put it back to disk it will keep it there that's going to increase processing speed a lot and spark typically will perform up to 100 times faster than mapreduce that alone is enough to sell it but i think there's some other distinguishing uh distinguishing factors here one is spark sore was able to benefit i think from looking at what hadoop didn't do well it did the scale out well but it didn't really do the best job for instance at interfacing with it for instance not everybody wants to write in java including me so spark added generalized apis they made it very exposed so that anyone who wants to write a language interface to spark could fortunately python being the active community that they are they wrote something called pi spark and that gives us a module with all graphs of greets function functions we're used to that look and act a lot like pandas and the like and we can now interact with spark r also has libraries created for it so you can use the r programming language and native to spark is a language called scala and a little after the core implementation of spark sql was added which is huge because now you have structured query language support on top of spark so this is a lot of stuff we also have machine learning language scaled out machine learning with mli or ml lib we have streaming with spark and we have graphics pretty cool pretty exciting now most of the functionality in spark really focuses on its data engineering site it's munging and querying facility that's that's the real power that's what you need mostly and because of that this series i'm going to focus pretty much exclusively and that's unless i do decide to dip in a little bit more in other areas on the data engineering which is all based on spark sql although it's only one box if you were to represent how important spark sql is to spark i think it should take about eighty percent of the upper box just to represent how important it is and we're going to learn a lot about that that's what r and python really leverage we will dip into the machine learning side as well scaled up machine learning and i'm even going to get into some other nuances or extra features like how that might integrate with some cloud services because i have that my book so i'm going to pull that in probably not going to cover spark streaming yet i'd like to get to the basics get you up to speed make you a jedi knight as i mentioned in my original orientation video that's the goal not a jedi master but spark streaming is in my plans down the road once we get through the basics graphics is pretty cool too probably the least use of the four services but the idea behind graphics is that you can do network queries what do i mean by that well you're on linkedin right and linkedin people have people connected to people connected to people so i might want to start with brian and say who's connected to brian brian's connected to marion sue who's who are they connected to well mary's mary is connected to bob bill and harry sue's connected to jane bill and bob you know whatever and keep navigating down this network it could also be used for physical kinds of networks like actually going from boston to la and going through these what they call nodes which are points and then the lines between them are edges there's many ways that you can use that and it's getting more and more popular especially when you think of iot types of things there's a lot of applications for that so we've got all these pieces that's apache spark it's pretty cool i like it a lot and so i'm doing all this massive parallel processing meaning lots of it like things processing at the same time and most importantly databricks is again using all of this but it's adding things around it as well and that's something you'll see more as we get through the series and we start going through demonstrating things the overall architecture in this slide is meant to kind of give you again that sense the most important thing to really reinforce here is that you've got things running in parallel you've got all these you know you think of people all working concurrently right doing something and you're gonna pull it all together at the end the importance of that can't be over emphasized because it has a lot of impact you may not even think of for instance there's a lot of work going on to keep the data separate you don't want data from one node to be shared with another node in the sense that you don't want duplication if a city of data is being processed on this node and then separated by city you need to make sure that that's all that's in that node because if you get redundancy of the data handled by the node it's going to give you bad results there's a lot of different things that can go on here and when i get deeper into the presentations i'll be able to talk more about that so this architecture and keeping it in the back of your mind is important the driver context is you that's you you're in there and you say i want to do something and i think of it a little bit like a prompt at you know a shell prompt that you might get in python or even on like a bash prompt you want to do something think of that as your context that context connects you into what's called the cluster manager the cluster manager is responsible essentially for getting the work done so if you think of it kind of like walking up to you know your dunkin donuts and you want to buy a few things and they say what do you want and then they tell mary and bob and bill to stop making your coffee and to get you a donut and all these things that's the cluster manager they're coordinating what's going on based on user request and then the worker nodes underneath the covers they're doing the actual work they're going to get all the work done and you can see the little box is called cache keeping it in memory going to do what's needed give you the result back at the bottom you see data sources and notice first is hdfs so you'll hear buzzwords like um azure data lake service gen edge data lake gen 1 gen 2 those emulate hdfs so they act like that but they're implemented in a much more efficient way and using cloud technology to be even faster than you would typically expect but you also have things like you can have sql databases or using sql interfaces you have no sql engine you can pretty much access anything there's lots of drivers that you can go through in spark and hence databricks to get almost any data you can think of really key thing to understand with spark also is because i get this question a lot so where do i store my data well it's kind of like asking you've got a vacuum cleaner where do i put the carpet and i say that's kind of you to say because spark is a query engine spark is a data analytics engine it's meant to run queries it isn't a storage engine so it's not like a relational database you say well i've got it all in one right i'm storing the data and occurring it and when people get into data lakes i think this is a big point of confusion data lakes is that is becoming less and less almost meaningful data links are just a place like a file folder you put things and yes it may be hdfs under the covers and it may have things to make it faster and even partitioned but in the end it's kind of just the place you put the data spark really doesn't care much about that spark's just going to take the data from wherever you tell it to get it and then it ingests it and it's processing massively in parallel so again spark is not the same as the data they're separate and spark really doesn't care about where you put the data or where it resides that's up to you to get it in story so don't confuse something like spark with storage and like data lakes they're really two different things though having data lake and optimized storage is certainly a good idea when you're dealing with a lot of data et cetera now when we look at apache spark we've seen this a little bit but i want to talk a little more about the sort of services that you see in these sort of tiers here and in the middle is spot core spark core is kind of the original service it uses a data structure under it called resilient distributed data sets we're going to see a little bit more about that and it was a pretty simple service in the beginning and the other services emerged quickly and are still evolving but spot core is sort of the baseline that everything works from now to manage these services there's several different software systems you can use we saw yarn yet another resource negotiator which is part of the original hadoop mapreduce and guess what that works fine for managing resources in spark this is where i think the thinking out process that went on with spock was pretty good because it's it it's flexible it's extensible and it's interchangeable with things which is really important when you want to get something that has long term growth mesos is another popular resource manager and spot comes with its own scheduler as well they call scheduler it's a resource manager above this at the you see these four arrow pointing down boxes the most important is spark sql this was added somewhat after the fact for out reasons i'll talk about soon but that gives you structure query language support so you can query spark data just as if it were database where it resides mind you it's not a physical database in spock but and by the way yes i have a boston accent and i drop my r's sorry i know i do that but anyway you're going to see spark sql supports that but we'll see more about how it implements it we have ml lib which provides massive scale out for machine learning spark streaming which allows us to have streaming directly into spark and graphics so we've seen these things but i just want to sort of reinforce that it also is not only meant for batch processing the way mapreduce is it supports interactive queries and you can architect it in ways that it will be very responsive and there is really a very wide range of applications that it can support and as i mentioned you have support for python r scala java and sql so let's take another look and if you've hopefully that right diagram that i pivoted on its side looks familiar that's spark that's what that represents on the left is the azure databricks user interface or sometimes i'll even call it the databricks portal now this actually i should just say databricks portal not azure databricks but this is where i got the screenshot but if you were running aws it looked the same what you see in this is a user interface and the ide that we're actually using in this case is what's called the databricks notebook if you use something like zeppelin or jupiter notebooks it's the same idea but it's a very powerful interface it's got a lot of extra enhancements to it that make it really nice for development some of the tools that are added on include the notebook itself i've mentioned and by the way it also has supports a kind of file storage system which is based on integrated blob storage in other words in aws or in azure it will automatically attach itself to something a blob storage account and consider that like a local file system drive to make your work easier another key thing that i think you can't even overestimate is the easy ability you have to create spark clusters on the fly when you need them pause them start them again and do this now databricks provides for an umbrella over any number of spot clusters but typically when you do spark you would actually create one cluster at a time and do what you need to do with it and then create a separate cluster again and you don't have really a great way to administer and use them a set of them you kind of teach take one at a time so that's also a really nice feature that you get with data books there's also a job scheduler when you work with data you quickly learn that you want to run things on a schedule you don't necessarily want to get on at 2 a.m in the morning or something or a sunday night to run something so the ability to set a program to run at a certain time on given days is really powerful and fortunately there's one built into databricks we also have security enhancements secure collaboration i don't think i'm pointing at the right button up there but the idea is you have permission settings and permission settings are in various places throughout databricks to allow us to say that these people can share my notebooks these people can share the cluster and it has settings like they can run the cluster or use the cluster they can start the cluster or even whether they can create clusters all these things are secured so you can share you can set read only in many places and it's integrated with active directory on azure which is really nice and aws has its own integration for security so you get a lot of nice benefits from that the other thing you'll notice if you look at this box and these little gray cell boxes that's where your code goes in a notebook you'll notice it says db utils and fs mount those are language extensions that are not in open source spark but there are things that databricks is adding to give you more functionality and this particular statement is allowing us to reach out and connect to a separate external storage account in azure so we can go and connect to that and then use it as if it were local so there's a lot of again features it also has performance optimization so there's a lot of things that will happen in in databricks that make it much faster than if you're using open source spark and i want to emphasize that again the api for databricks was designed for the cloud but it was not designed for the cloud in spark itself and you really see this because when i go in for instance recently and i went in to create a hdinsight spark cluster i waited 30 minutes i think it took before it came back and said okay and this was mind you a very small cluster doing the same thing in databricks typically takes about five minutes and not only that but in hdinsight you cannot turn off or pause a cluster the only way to stop paying anything for that cluster is to delete it but in databricks you can just pause it the definition stays there it's not much different than delete in perfect honesty but it remembers all of the data you have you don't lose your data because the blob storage is separate so your notebooks are certainly there and you can still go to those even with no clusters available but most importantly you can start and pause pause clusters at will and this gives you a lot of flexibility over how you run the environment and pay for resources all runs on spark to kind of re-summarize like what are we getting with databricks versus spark if you picture the sparkcore engine there and what you're getting when databricks is added databricks adds this easy maintenance of clusters through gui you get sort of a simulated file system on top of some sort of blob storage you get databricks notebooks which gives you your development environment you get the ability to do libraries which i didn't mention but very often you're going to want to reach out and install custom libraries maybe you want to bring in special python our libraries and install the model cluster so that you can use them now there's a lot already included in the data breaks databricks runtime so you probably may not have to do that for instance like pandas and typical libraries already there but if you do find one i was using one called psych from r for instance in a demonstration you can install it and they have a whole gui to go down and install another nice feature is when you restart clusters it will automatically remember what libraries you want on that cluster and reinstall them which can be actually a pretty big maintenance task if you were just using open source spark you also have as i mentioned job scheduling and of course all the security settings so that's kind of the takeaway i hope this is clear about what one does versus the other and again since its databricks is running on spark spark supports python r scala java and sql now java is sort of indirectly supported when you're in databricks because you cannot just put java code in a databricks notebook but since scala is a language which uses the java the jvm that any libraries and things that java has you can leverage through scala and you can call them in databricks and that means you can also import them the libraries so sort of one kind of visual i hope helps apache spark is like legos on the left it's the bare bones it's it's the starter kit you get what you need you can build what you want flexibility but you're not really giving anything to jump start with databricks is more like on the left it's you know you got the castle you got the walls you got all this cool stuff not only are you getting a jumpstart quickly but i also want to emphasize there's a certain fixedness as well databricks gives you certain services if you don't want those services if you're not going to value get a value benefit from them for instance you just want to run a program in spark 24x7 it could be like an iot or streaming i talked to a customer once they would just do running a spark cluster constantly to do support their production environment then database probably isn't the best answer because it's not really giving you a lot of benefits you just want to run a cluster and it's going to be cheaper to just start up a cluster and run it if you don't need any of the other tools so wrapping up we talked about what apache spark is and we learned that it kind of came from this history of apache hadoop mapreduce and leverages that scale out architecture where we use multiple machines concurrently to get work done and i will back up because i know someone will correct me on this technically we're running multiple processes concurrently in parallel may or may not really be physically separate boxes but it will be separate processes running that don't interfere with each other so that's more technically correct that's apache spark we learned that databricks is a company that was founded by the developers of apache spark with the intention of really making spark much easier to use enable people get you able to jump start and get in quickly and generally focused on data science large data science team collaboration we talk about scaling up scaling out and bury the weight lifter and kind of re summarize that versus one machine that versus parallel processing potentially on many machines and we talked about the apache hadoop project and all the different things that are in it which i equated to my kitchen drawer because it can be confusing and you're not alone it is confusing so don't feel bad if you're like wow there any tools i don't know what they are that's why i wanted to walk through them the only thing you need to worry about for this series is apache spark and then finally i hope i've clarified the difference between spark and data bricks if it's a little vague watch my video again it's pretty exciting but if not don't worry we're going to be going through concrete examples we're going to be creating our tools we need in the cloud and do what we need to see how this works and i'll be with you holding your hand the whole way and explaining how all this works and i hope you like it and i hope it's clarifying things so like subscribe like share subscribe all those good things but most importantly get value out of this and i hope you do thank you

Info

Channel: Bryan Cafferky

Views: 19,771

Rating: 4.9732442 out of 5

Keywords: Databricks, Spark, data science, machine learning, Python, SQL, big data, analytics, training, tutorial

Id: C496WTDhyFo

Channel Id: undefined

Length: 31min 59sec (1919 seconds)

Published: Mon Jan 11 2021