(upbeat music) - Hello, and welcome back
to Microsoft Mechanics Live! Coming up next, we're gonna have a leap and we're gonna have a
look at the leap forward in SQL Server, its next leap, and we're gonna show you
big data capabilities that are now built into SQL
with HDFS and Apache Spark, as well as our new approach
for data virtualization that allows you to easily integrate data across multiple data sources without even needing to
move it or perform ETL. So, very cool stuff. And to join me, I've got Travis Wright from the SQL Server team. Give a big, warm welcome.
- Hey. (audience applauds) Thanks, Jeremy. Glad to be back on the show. - So we've actually done a lot in terms of advances
recently with SQL Server, all the way from the
platform abstraction layer, the PAL, as I like to call it, that enables you to run
SQL Server on Linux, the machine learning now
built in box in SQL Server. But with all the data that's
out there in big data silos, adding these capabilities
that we're doing now in terms of the integration
and the data virtualization to SQL Server, in a lot of
respects, in a lot of ways, is way more useful and
even bigger milestone for a lot of people. Can you tell us more about that? - Yeah, we know that integrating data across all these different
data silos is hard. The amount of data out there
is just enormous anywhere, so we really started our
journey on data integration back with SQL Server 2016, when we introduced the PolyBase feature, and that allows you to
issue a query to SQL Server and SQL Server will turn around and then go and query data out of HDFS in like a Cloudera or
Hortonworks cluster, for example. And now, with SQL Server 2019, we're making it even easier for you to integrate all of your data without having to move it around. And we have a brand new architecture that's designed to give
you huge improvements and performance and scalability through caching and parallel processing, as well as having an
elastically rescalable compute and storage layer for SQL Server. - Very cool. Why don't we explain how this
all works architecturally? - All right, sounds good. We do this in two ways. First, we're deeply integrating the SQL Server parallel processing engine with big data components from the open source
community like Spark and HDFS that are now included in
the box with SQL Server. They are a part of SQL Server. That was the big announcement this week. And then secondly, we're going to provide data
virtualization connectors for common data sources like
Oracle, Teradata, MongoDB, SQL Server, and a generic ODBC connector to make it even easier
for you to integrate all of these data sources
through SQL Server without having to move your
data around through ETL. - Which is really great 'cause big data is now part of SQL Server. Now you don't have to move your data or create separate reporting databases. You can keep it all in place. - Right. So instead of moving your
data, transforming it, and creating duplicate data
and all that kind of stuff, we've really focused our efforts around data integration
on data virtualization now instead of data movement, and that allows you to
keep your data in place and yet, still be able to access it from a single point with SQL Server. And the nice thing here is
that we can really boost the performance of these queries by caching that data and partitioning it across multiple SQL Server instances that really kind of make
that query really run fast. And you can even combine data
from multiple data sources into this cache, so you can have a joint, for example, that goes across some data that's coming into that cache from Oracle and from another SQL and
from HDFS, for example. - So it's way more secure. You're not making a persistent
copy again of that data. There's less maintenance, fewer ETL jobs. You don't have to do ETL in this case. It's more efficient storage 'cause you're not starting it twice. The data's actually queryable live against the virtual cache. It's not gonna go out of
freshness or out of date. It's gonna stay accurate, right? - Right, exactly. A lot of benefits there. So, let's make this real
by doing a demo here. I'm gonna show you being able to query across multiple data sources here. So I'm running here inside
of the new Azure Data Studio, which was formally known
as SQL Operations Manager. It's our cross-platform database tool that runs on macOS, Windows, and Linux, and it's open source. And here, I'm connected to an HPE lab where we have an HPE cluster of machines that has the SQL Server 2019
big data cluster running on it. We're gonna go to the Sales database here and I'm just gonna right-click on that. I'm gonna click Create External Table. This is gonna bring up a
new wizard that we have that makes it easy for you to do this. I'm just gonna select
the Oracle type here. We'll have more types over here over time. And then I can click on Next here. I already have a database master key, so I'm gonna skip that. I'm gonna give this a name. We're gonna pull in the Oracle Inventory database from Oracle. I'm gonna grab the server name. We're gonna go into the XE. We're gonna create a
database credential here. We're gonna go to SYSTEM and admin123. Now, what this does is it actually reaches
out to that Oracle server and it discovers the schema
inside of that database and I can then select the table that we wanna synchronize over. So, we just click on this
inventory table here, for example. We can even select the table. We can see the schema
about that table here and we can do some column
mapping if we wanted to and we can maybe even
change the name of the table that we wanna bring in
inside of SQL Server. Click Next, click Create, and now, we have an external table that shows up inside of
our table's list here and we can now go and query that table. - [Jeremy] And there we
go, very cool, very cool. - Let's move on to creating
an external table over HDFS. - Okay, let's do it.
- Let's do that. I've got a query here that we can run. And the nice thing about
this here is that we have a, we got a new query from here, we have a built-in external data source called SQL storage pool, and what this does is this allows us to query the data inside of
HDFS using a SQL Server instance that's actually co-located
on every HDFS DataNode. It's just built-in. You don't
have to do anything special. You just define the read schema that you want us to apply to the files that sit in a particular
directory or to a particular file. So now, we can create this table here. Connect to our SQL here. Once we have that table created, we can then just query over that table just like we would any other type of table and right now, what we're doing is we're actually reaching down into HDFS where we have millions of
records in this CSV file that sit inside of this directory and we're bringing it
back up to SQL Server, just like you query any other
type of table in SQL Server. - And there it is.
- There it is. - Very cool. So, I can see how this
would hugely valuable, especially when you're dealing
with compliance requirements. You don't wanna move the data outside of that Oracle database or, in this case, you wouldn't wanna move any of the data actually. You'd be able to query it here. It looks local then effectively to SQL. And the nice thing is you mentioned that Spark and HDFS,
they're really in the box. What does it look like though and how does that part of it work? - In that demo we just
saw of querying the HDFS, it's part of SQL Server. You can also query that data and manipulate it using the Spark engine that's also co-located on that data node. So now, you have sort
of a shared data lake between SQL and Spark where you have all that
data sitting in HDFS. So now, what I wanna show you is another demo where we're going to use either the Data Studio, to use notebooks, to interact with the Spark runtime. - Okay, let's do it. - So first off, I wanna show
you a notebook experience. It's built into Azure Data Studio here, and this is where I can come in and I can write some code
in Python or Scala or R and I can execute that either locally here on my machine using the Python 3 kernel underneath the covers or I can submit jobs into
Spark to have it run there. Now, what I'm showing you here is I have this file called Artists.txt. It's a CSV file, but it's got some pretty
crazy data structures here. It's got a bunch of stuff up here that defines columns
but it's in a weird way. There's a bunch of whitespaces. - [Jeremy] So it's nearly unstructured even though it's a CSV. - [Travis] Yeah, I mean,
this is a mess, right? - [Jeremy] Yep. - [Travis] We've got this research stuff coming out of Microsoft
research called PROSE that allows you to basically
make sense of this data by running some code against it and it will generate code,
which you can then run against those types of files and generate a schematized version of it. So to do this, I'm just gonna
run this PROSE library here. This is included inside of Azure Data Studio Notebook experience and inside of the Spark
runtime and inside of SQL 2019. It generates this Python code for me that when I go down here and I run this, it will load the data from
that file into a data frame and, what did I do wrong here? Oh, I missed the pandas at the top. - [Jeremy] Maybe the top line? You gotta have the panda. - [Travis] Gotta have the panda. - [Jeremy] Otherwise,
the panda will be sad. - [Travis] There we go, include the panda. There we go.
- There it is. - So, the data is now structured. We can actually make sense of it and this makes it easier
for SQL Server to read it. Now, the cool thing is I can
take this same Python code and I can go over and
inside of the Spark runtime on the top of the HDFS, I can start to apply this to
files that are over there. So here's an HDFS browse experience where I can actually browse the HDFS sitting inside of the
SQL 2019 big data cluster and we can see that I've got that Artists.txt
file already there. Now, let's physically
go and create a function that will be available
to us now inside of Spark and we can run the same thing over that Artists.txt file
inside of the Spark cluster and you can see that it now
schematizes all of this. And now I can, if I wanted to, save those back as a CVS file that's in a nice, beautiful
format, for example. So this is how you can
use something like Spark to do all of your data
preparation tasks at scale across all of your Spark infrastructure. - Much, much better. Now that we've prepped the data, we've got it all into good
format, can we analyze it? - Yeah, definitely. That's with whole point of this, right? So, we wanna build and get
to where we can analyze it. And this is where you wanna
be able to combine together kind of the different data
sets from Oracle, from HDFS, and then feed that into your
machine learning services that are built in to SQL Server where you can run Python or R
scripts to train your models. You can store that model
back inside of SQL Server, to operationalize it, and then just as part
of a stored procedure, you can actually store your
data against that model, maybe as part of a transaction or a batch processing thing there. So, let's take a look
at another demo here. I've got an eCommerce site scenario here that we're gonna look at and we're going to use
machine learning services to predict a pattern of
behavior from a clickstream of data from somebody who
visit our eCommerce site. So, let's pivot over and
take a look at that now. So on this one, we got another script here that we've already created ahead of time. Let's bring this up now. So on this one, we're
going to, first of all, kickoff the model training. So when I kickoff the model training, this is gonna call this
stored procedure here called train book category visitor. It's gonna produce some model for me. You can see the output from
the production in that model. Well, let's go take a look at what that actually looks like here. So, what this stored procedure does is it defines a SQL query that will go and get some training
data out of my database. And then it will call some R script that will be doing some
featurization of this using a logistical regression algorithm. And then what we do is once we run this sp_execute_external_script
right here, this will execute this R code against the training data we're sending in and will produce a model which we then store inside of the database by inserting it into the
sales_models table here. Once we've done that, we
can then go back out here and we can do our
predictions against this data by using this stored procedure here. You'll see that based upon
all the clickstream behavior that we have sitting in HDFS in raw file, we pull that up, we score
it against the model, and we can actually predict based upon somebody's clickstream behavior whether or not we should suggest to them to go visit the book
category, for example. So this allows us to bring together data from our inventory system in Oracle and our HDFS data that has
the clickstream data in it in raw format, in CSV files, and then feed that into
machine learning services inside of SQL Server, produce that model, and now, we can score
user behavior against it. - Very cool, so we've seen huge
architectural advances here in terms of this direction for SQL, being able to really parse
over and analyze big data without having to move it or do a lot of ETL work against it. This is really huge. - Yeah, it's big. (speakers laughing) Sorry for the pun, but yeah,
it's big, this is huge, and this is really like
a different dimension for SQL Server to scale in to where you get this caching ability, you get elastic scalability
of your performance, tiered with compute and
storage scalability, all built on top of this
native architecture. So it's really, if you think about it, SQL Server is not just a
database really anymore. It's really more of like an integration and complete analytics platform that you can use to integrate
all these data together. - Right. Again, when we think about cross-platform, it's not only about operating
SQL inside of Windows or Linux or multiple operating system types, but also against other databases. So, if somebody wants to learn more about these capabilities and really try it and go hands-on for themselves and connect to Oracle and
start using Spark and HDFS, where do they learn more? What should they actually
be doing right now to get a hands-on? - Yeah, exciting news. We announced the SQL Server
2019 preview this week. You can go and get that for SQL
Server on Windows and Linux. And if you're interested in the
big data cluster part of it, we encourage you to sign up for this early adoption
program at the link here where we can then assign you to work with the program
manager on our team and then we can really
get hands-on with you as you try to evaluate these
big data cluster technology and we can learn from you, kind
of how that experience goes. - Awesome, great to have you
on the show again, Travis. Thank you. Thanks for coming, and that's all the time
we have for today's show. - Thanks, everybody.
- Goodbye for now. (audience applauds)
(upbeat music)