Big Data Clusters in a Nutshell | Data Exposed: MVP Edition

Video Statistics and Information

Video

Captions Word Cloud

Captions

>> In this episode, you'll learn about big data clusters in a nutshell from Ben Weissman. Join us on this episode of Data Exposed MVP edition. [MUSIC]. >> Hi. I'm Anna Hoffman and welcome to this episode of Data Exposed MVP edition. Today I'm joined by Ben Weissman, a data passionist from Germany. Ben, thanks for joining us. Can you tell us a little bit more about what you do? >> Sure. Thank you for having me. Yeah, I'm running a small consultancy firm here in Nuremberg, Germany called Solisyon. We're mainly focused on data warehousing, data integration. We've been doing this for quite a while. SQL Server 6.5 is when I started. So I think you've been at elementary school at that point. >> Potentially. >> So yeah, integration services, all kinds of data warehousing and then BI, the full-stack. One part of that that we're currently looking at, and that I'm very passionate about is big data clusters, a new technology that came with SQL Server 2019. Yeah, I started loving from the very beginning. >> Awesome. Yeah. I know from working with you throughout the past few years, you've definitely been on it since the beginning. So today, you're going to tell us about it in a nutshell. I'm really interested to see how you're going to pull this off. So I'm just going to pass it over to you. >> All right. So a big data cluster, I think Trevor's right came up with that, it's not your grandpa SQL Server. This is a pretty great description on it. So first of all, it only runs on Linux, which is amazing because if we just think a couple of years back there wasn't SQL on Linux at all, and now we've got a feature that only runs in containers and only runs on Linux. So it all runs on the Kubernetes platform. So it is a whole new approach how SQL applications are getting deployed. So that's one thing that's super exciting about it. I want to give you first a little nutshell introduction on the whole architecture of the whole thing, and then we're going to get our hands on it and see how that actually works in real life. So the cool thing is, as complex as this looks, there is one guy in that whole picture, which is the SQL Server master. That is just a regular SQL Server 2019 with the only exception that it runs on Linux, which may not be the typical thing. Behind that, there's a couple of so-called pools. These pools are what makes big data clusters so exciting. The first one where basically everything is funneled through is the compute pool. What the compute pool does, it scales queries, so it boosts SQL Server queries to performance, it hasn't been there before. From there, it either goes to data that sits outside of SQL Server, that's called data virtualization. SQL Server Is using PolyBase for that. But again, not your grandpa's PolyBase from SQL Server 2016. But now you can get data from Oracle, Teradata, MongoDB, other SQL Servers, all kinds of stuff and bring them into SQL Server and query them just like local tables. Then there's the data pool up there. The idea of the data pool is, you have not one SQL Server, but multiples. In my picture it's two, but it could be four or eight. The idea is you take your fact data. So think of a big fact table with a 100 million rows. You push it to the data pool with two nodes, you end up with two identical looking tables, but each of them only hold 50 million rows. So it's getting split in half. The theory being, well, if it's only half the data, it should be twice as fast. That's the whole idea of the data pool. That's also something that has never been there in SQL Server in that way. The other component is the storage pool. It combines HDFS, so a distributed file system with Spark and SQL all in one component. All that we see here when you deploy a big data cluster is being deployed for you at once. So as complex as it looks, you're only deploying one product, and SQL Server is making sure everything talks to each other, and you only talk to the master. So you only got one endpoint that you need to talk to. Only exception for that is the app pool, the application pool. We're not going to go into detail with that because it does pretty much everything and nothing. You could use it to deploy machine learning models, AI models, stuff like that, integration services, jobs, so yeah. It is super powerful, but it doesn't do much out-of-the-box. So what I would like to do today is, we will go to GitHub, get us some data around COVID-19, which is the reason why we're not sitting in the same studio today, but each of us are sitting at our homes. Get some COVID data from John Hopkins and also some population data, which isn't an open dataset, because that's another thing that people regularly underestimate, in my opinion, that there's so much data available out there freely for everyone to use. You will get that using Spark and Python. Push it to the data pool, then to the storage pool. These files coming from John Hopkins, they have one issue. They're super up-to-date, so they've been getting updated daily. But they're very wide because what John Hopkins does is they add a column for each day instead of adding rows. So what we're going to do is using a very simple Python function, all in Azure Data Studio and all connected to our big data cluster, we will transform this to long Parquet files. Parquet clustered column store index for CSVs, it's just way faster. We're going to take these, merge them together, and push them to the data pool for caching. We need some local mapping table because the country names in that population CSV and the way they come from John Hopkins, unfortunately, they're not identical. Then we're just going to run a query across them. So we basically take one SQL query that combines data from the storage pool, from the master instance, and from the data pool all using regular SQL because all that data, these CSVs, these Parquet files, and everything on the data pool for me as an end user, this looks just like a regular SQL table. >> Awesome. Ben, this has been great so far. I mean, you've really given us a wide range of capabilities that are available on big data clusters. I'm really interested to see how we're going to do all of this by just using SQL and just connecting to, I think you said we're just going to connect to the master instance for this. >> Correct. >> Awesome. Cool. Well, let's take a look. >> All right. So let's jump right into Azure Data Studio. For those that have not been working with Azure Data Studio before, first of all, you really should because it's an amazing tool. It's not taking your Management Studio or anything away, but it has a lot of features that Management Studio just hasn't. One of these features are so-called Notebooks. That's a super cool feature because Notebooks combine the idea that I can put in text, I could put them in graphics, and I could put in code, and they also save the results. So the first notebook that I've prepared you, we're not going to run this, but it's all on GitHub. So if you want to try out that demo for yourself, just get them. That first thing we'll just create the database and everything for you including all the schemas that we're going to need and all the end points. So let me close this. But if you want to get started and get your hands on this, just download it from GitHub and run that first Notebook and it will prepare and set up everything for you. This is where it gets interesting. As you can see, Anna, we're still just connected to my endpoint, to my Ben big data cluster here, which is the same endpoint that we're going to see over and over again all across any kind of these demos. This is where we're getting the data from, and I'm just using Spark here, Spark and Pandas actually. So I'm reading that CSV file from GitHub, filtering that on 2018 because that's the last year that has population data for every single country. Then I'm just saving this as a CSV file to my HDFS. That's one of the other cool things here in Azure Data Studio because Azure Data Studio actually exposes the HDFS right here for a big data cluster. So I can basically use Azure Data Studio as my file browser for my HDFS within my big data cluster. So this is really my one-stop shop for everything that's happening in my BDC. So I just ran that script, so this updated my CSV file on my storage pool. The next one is a bit more interesting. So let me just open this on GitHub real quick. What John Hopkins does here, and this is from two hours ago, on a daily basis, they just provide CSVs globally and for the US for confirmed death and so on. Like I said before, you get it per country, including coordinates, and then they just add a column for every single day for of these cases. While this might be convenient for them and an easy way for them to provide it, it's not ideal for a SQL. So if you would do this using traditional SQL, what you would need to do is you would write some Dynamic SQL and then you have to unpivot it. So what we do instead is, we will use Pandas and Spark again. There's a very neat function here in Pandas, which is called melt. What Melt does is basically a dynamic unpivot. If we put that in this function here, I can just call it for the three files that I'm downloading: recovered, deaths, and confirmed. What this will do is, it will download the three files and convert them all into Parquet files and convert them from wide into long. So it also gives me a little preview, just the first five rows of each of these files. So we see, okay, data is as of June 24th, that's yesterday. If we expand our HDFS, you will see the CSV and the Parquet files. I'm usually saving CSV and Parquet. Even though I'm only going to be using the Parquet in that case, but CSV is just easier to read because it's unencrypted, it's uncompressed and all that. What you can already see here, every single file here is divided in two parts because I'm running two nodes on my storage pool, so there's going to be to actual machines doing the work for me. Since I'm more of a SQL guy, what I can just do now is, I can take all this, still connected to my Ben BDC, and I can just say, "Select top 10 for my Parquet file. COVID_recovered." I'm querying a Parquet file on my storage pool as if it was a regular SQL table. I can take that data in another query and push it to my data pool. So I'm pushing this to DP as a data pool COVID development. Again, I can just select count on this just like a regular SQL table. So from a user perspective, it really doesn't matter. I can even look beyond the covers. So if I do this, I can see I'm running that query at my data pool. So what this will do is, it will run the query at every single node. Again, there's two of them. So what you will see here is, it's not exactly evenly distributed because the data pool does the round-robin node record wise, but it does it in chunks. But as you can see, we're almost equally distributed here. So again, when I'm running this query, I will just have two nodes each of them doing half the work therefore, getting way better performance. Then I can do all of these kinds of queries coming from the data pool. Since I'm in Azure Data Studio, I can also make this a chart and say, "What's my COVID development for Germany, for example?" But then, that's the actual real cool stuff. I can go ahead here and run a query that is combining my data pool with my mapping table on my master instance with a CSV. So I'm live querying a CSV against a table that is actually sitting on two distributed SQL servers against the local table, and they all look just like a regular SQL Server query, and they come back in no time. Still, for those that say, "Well, T-SQL is not really what I'm doing." I can also connect to the same endpoint, and just do the same analysis and the same querying in Python. That's the beauty of BDC. You pick your language and you-all talk to the same endpoint, and you can do it all in one tool as you like it. But still, since this is a regular SQL Server, this could be exposed to any kind of BI tool. I could connect to that using Excel, I could connect to that using Power BI, or whatever, and just keep working and wrangling with that data as if it was a regular SQL Server, but it's not. >> Wow, Ben, that's pretty powerful stuff. You've shown us how you are getting data from GitHub, leveraging tools like Spark, using just T-SQL, and joining all these different files together. You've also shown us how we can manage our File System in HDFS from Azure Data Studio. There's a lot of stuff going in here and a lot of capabilities here. So thank you for sharing all this with us. For our viewers, we'll make sure we put a link to Ben's GitHub in the description. So we encourage you to go check this out. This is super relevant data especially, in these times. Thank you so much Ben for being here. >> Thank you for having me. It was awesome. >> Awesome. For our viewers, if you like this video, please like this video or leave a comment for what you're doing with big data clusters. We'll see you next time on Data Exposed. [MUSIC]

Info

Channel: Microsoft Developer

Views: 858

Rating: 5 out of 5

Keywords: SQL, Big Data Clusters

Id: Y3i94Viygi8

Channel Id: undefined

Length: 13min 42sec (822 seconds)

Published: Tue Jul 07 2020