>> In this episode, you'll learn about big data clusters in a nutshell from Ben Weissman. Join us on this episode of
Data Exposed MVP edition. [MUSIC]. >> Hi. I'm Anna
Hoffman and welcome to this episode of Data
Exposed MVP edition. Today I'm joined by Ben Weissman, a data passionist from Germany. Ben, thanks for joining us. Can you tell us a little
bit more about what you do? >> Sure. Thank you for having me. Yeah, I'm running a
small consultancy firm here in Nuremberg,
Germany called Solisyon. We're mainly focused on data
warehousing, data integration. We've been doing this
for quite a while. SQL Server 6.5 is when I started. So I think you've been at
elementary school at that point. >> Potentially. >> So yeah, integration services, all kinds of data warehousing
and then BI, the full-stack. One part of that that we're
currently looking at, and that I'm very passionate
about is big data clusters, a new technology that came
with SQL Server 2019. Yeah, I started loving
from the very beginning. >> Awesome. Yeah. I know from working with you throughout
the past few years, you've definitely been on
it since the beginning. So today, you're going to tell
us about it in a nutshell. I'm really interested to see how
you're going to pull this off. So I'm just going to
pass it over to you. >> All right. So a big data cluster, I think Trevor's right
came up with that, it's not your grandpa SQL Server. This is a pretty great
description on it. So first of all, it only runs on Linux, which is amazing because
if we just think a couple of years back there
wasn't SQL on Linux at all, and now we've got a feature that only runs in containers and
only runs on Linux. So it all runs on the
Kubernetes platform. So it is a whole new approach how SQL applications are
getting deployed. So that's one thing that's
super exciting about it. I want to give you first a little nutshell introduction on the whole architecture
of the whole thing, and then we're going to get
our hands on it and see how that actually works in real life. So the cool thing is, as complex as this looks, there is one guy in
that whole picture, which is the SQL Server master. That is just a regular
SQL Server 2019 with the only exception
that it runs on Linux, which may not be the typical thing. Behind that, there's a
couple of so-called pools. These pools are what makes big
data clusters so exciting. The first one where basically everything is funneled
through is the compute pool. What the compute pool
does, it scales queries, so it boosts SQL Server
queries to performance, it hasn't been there before. From there, it either goes to data that sits outside of SQL Server, that's called data virtualization. SQL Server Is using
PolyBase for that. But again, not your grandpa's
PolyBase from SQL Server 2016. But now you can get data from
Oracle, Teradata, MongoDB, other SQL Servers, all
kinds of stuff and bring them into SQL Server and query
them just like local tables. Then there's the data pool up there. The idea of the data pool is, you have not one SQL
Server, but multiples. In my picture it's two, but
it could be four or eight. The idea is you take your fact data. So think of a big fact table
with a 100 million rows. You push it to the data
pool with two nodes, you end up with two
identical looking tables, but each of them only
hold 50 million rows. So it's getting split in half. The theory being, well, if it's only half the data, it should be twice as fast. That's the whole idea
of the data pool. That's also something
that has never been there in SQL Server in that way. The other component
is the storage pool. It combines HDFS, so a distributed file system with
Spark and SQL all in one component. All that we see here when you deploy a big data cluster is being
deployed for you at once. So as complex as it looks, you're only deploying one product, and SQL Server is making sure
everything talks to each other, and you only talk to the master. So you only got one endpoint
that you need to talk to. Only exception for that is the
app pool, the application pool. We're not going to go
into detail with that because it does pretty much
everything and nothing. You could use it to deploy
machine learning models, AI models, stuff like that, integration services, jobs, so yeah. It is super powerful, but it doesn't do
much out-of-the-box. So what I would like to do today is, we will go to GitHub, get us some data around COVID-19, which is the reason why we're not sitting in the same studio today, but each of us are
sitting at our homes. Get some COVID data from John Hopkins and also
some population data, which isn't an open dataset, because that's another thing that
people regularly underestimate, in my opinion, that
there's so much data available out there freely
for everyone to use. You will get that using
Spark and Python. Push it to the data pool,
then to the storage pool. These files coming from John Hopkins, they have one issue. They're super up-to-date, so
they've been getting updated daily. But they're very wide because
what John Hopkins does is they add a column for each
day instead of adding rows. So what we're going to do is using
a very simple Python function, all in Azure Data Studio and all connected to our big data cluster, we will transform this
to long Parquet files. Parquet clustered column
store index for CSVs, it's just way faster. We're going to take these,
merge them together, and push them to the
data pool for caching. We need some local mapping table
because the country names in that population CSV and the way
they come from John Hopkins, unfortunately, they're not identical. Then we're just going to
run a query across them. So we basically take one SQL query that combines data
from the storage pool, from the master instance, and from the data pool all using regular SQL because all
that data, these CSVs, these Parquet files,
and everything on the data pool for me as an end user, this looks just like
a regular SQL table. >> Awesome. Ben, this
has been great so far. I mean, you've really given us a wide range of capabilities that are available
on big data clusters. I'm really interested to see
how we're going to do all of this by just using SQL
and just connecting to, I think you said we're
just going to connect to the master instance for this. >> Correct. >> Awesome. Cool. Well,
let's take a look. >> All right. So let's jump
right into Azure Data Studio. For those that have not been working with Azure
Data Studio before, first of all, you really should
because it's an amazing tool. It's not taking your Management
Studio or anything away, but it has a lot of features that
Management Studio just hasn't. One of these features
are so-called Notebooks. That's a super cool
feature because Notebooks combine the idea that
I can put in text, I could put them in graphics, and I could put in code, and they also save the results. So the first notebook
that I've prepared you, we're not going to run this,
but it's all on GitHub. So if you want to try out that
demo for yourself, just get them. That first thing we'll just create the database
and everything for you including all the schemas that we're going to need and all the end points. So let me close this. But if you want to get started
and get your hands on this, just download it from GitHub
and run that first Notebook and it will prepare and
set up everything for you. This is where it gets interesting. As you can see, Anna, we're still just
connected to my endpoint, to my Ben big data cluster here, which is the same endpoint that
we're going to see over and over again all across any
kind of these demos. This is where we're
getting the data from, and I'm just using Spark here, Spark and Pandas actually. So I'm reading that
CSV file from GitHub, filtering that on 2018 because that's the last year that has population
data for every single country. Then I'm just saving this
as a CSV file to my HDFS. That's one of the other
cool things here in Azure Data Studio because Azure Data Studio actually exposes the HDFS right here for
a big data cluster. So I can basically use
Azure Data Studio as my file browser for my HDFS
within my big data cluster. So this is really my
one-stop shop for everything that's
happening in my BDC. So I just ran that script, so this updated my CSV
file on my storage pool. The next one is a bit
more interesting. So let me just open this
on GitHub real quick. What John Hopkins does here, and this is from two hours
ago, on a daily basis, they just provide
CSVs globally and for the US for confirmed death and so on. Like I said before, you get it per country,
including coordinates, and then they just add a column for every single day for of these cases. While this might be convenient for them and an easy way
for them to provide it, it's not ideal for a SQL. So if you would do this
using traditional SQL, what you would need to
do is you would write some Dynamic SQL and then
you have to unpivot it. So what we do instead is, we will use Pandas and Spark again. There's a very neat function here
in Pandas, which is called melt. What Melt does is basically
a dynamic unpivot. If we put that in this function here, I can just call it
for the three files that I'm downloading: recovered,
deaths, and confirmed. What this will do
is, it will download the three files and convert them all into Parquet files and
convert them from wide into long. So it also gives me a little preview, just the first five rows
of each of these files. So we see, okay, data is as of
June 24th, that's yesterday. If we expand our HDFS, you will see the CSV
and the Parquet files. I'm usually saving CSV and Parquet. Even though I'm only going to be
using the Parquet in that case, but CSV is just easier to read
because it's unencrypted, it's uncompressed and all that. What you can already see here, every single file here is divided in two parts because I'm running
two nodes on my storage pool, so there's going to be to actual
machines doing the work for me. Since I'm more of a SQL guy, what I can just do now is, I can take all this, still connected to my Ben BDC, and I can just say, "Select top 10 for my Parquet file. COVID_recovered." I'm querying a Parquet file on my storage pool as if it
was a regular SQL table. I can take that data in another
query and push it to my data pool. So I'm pushing this to DP as a
data pool COVID development. Again, I can just select count on this just like a regular SQL table. So from a user perspective,
it really doesn't matter. I can even look beyond the covers. So if I do this, I can see I'm running that
query at my data pool. So what this will do is, it will run the query
at every single node. Again, there's two of them. So what you will see here is, it's not exactly evenly
distributed because the data pool does the
round-robin node record wise, but it does it in chunks. But as you can see, we're almost
equally distributed here. So again, when I'm
running this query, I will just have two nodes each of them doing half
the work therefore, getting way better performance. Then I can do all of these kinds of queries coming from the data pool. Since I'm in Azure Data Studio, I can also make this a chart and say, "What's my COVID development
for Germany, for example?" But then, that's the
actual real cool stuff. I can go ahead here and run
a query that is combining my data pool with my mapping table on my master instance with a CSV. So I'm live querying a CSV against a table that is actually sitting on two
distributed SQL servers against the local table, and they all look just like
a regular SQL Server query, and they come back in no time. Still, for those that say, "Well, T-SQL is not
really what I'm doing." I can also connect to
the same endpoint, and just do the same analysis
and the same querying in Python. That's the beauty of BDC. You pick your language and you-all
talk to the same endpoint, and you can do it all in
one tool as you like it. But still, since this is
a regular SQL Server, this could be exposed
to any kind of BI tool. I could connect to that using Excel, I could connect to that
using Power BI, or whatever, and just keep working and
wrangling with that data as if it was a regular SQL
Server, but it's not. >> Wow, Ben, that's
pretty powerful stuff. You've shown us how you are
getting data from GitHub, leveraging tools like Spark, using just T-SQL, and joining all
these different files together. You've also shown us how we can manage our File System in
HDFS from Azure Data Studio. There's a lot of stuff going in here and a lot of capabilities here. So thank you for sharing
all this with us. For our viewers, we'll
make sure we put a link to Ben's GitHub in the description. So we encourage you
to go check this out. This is super relevant data
especially, in these times. Thank you so much Ben for being here. >> Thank you for having
me. It was awesome. >> Awesome. For our viewers, if you like this video, please like this video or leave a comment for what you're
doing with big data clusters. We'll see you next
time on Data Exposed. [MUSIC]