SQL Server 2019 meets AI and Big Data | Ignite 18

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(upbeat music) - Hello, and welcome back to Microsoft Mechanics Live! Coming up next, we're gonna have a leap and we're gonna have a look at the leap forward in SQL Server, its next leap, and we're gonna show you big data capabilities that are now built into SQL with HDFS and Apache Spark, as well as our new approach for data virtualization that allows you to easily integrate data across multiple data sources without even needing to move it or perform ETL. So, very cool stuff. And to join me, I've got Travis Wright from the SQL Server team. Give a big, warm welcome. - Hey. (audience applauds) Thanks, Jeremy. Glad to be back on the show. - So we've actually done a lot in terms of advances recently with SQL Server, all the way from the platform abstraction layer, the PAL, as I like to call it, that enables you to run SQL Server on Linux, the machine learning now built in box in SQL Server. But with all the data that's out there in big data silos, adding these capabilities that we're doing now in terms of the integration and the data virtualization to SQL Server, in a lot of respects, in a lot of ways, is way more useful and even bigger milestone for a lot of people. Can you tell us more about that? - Yeah, we know that integrating data across all these different data silos is hard. The amount of data out there is just enormous anywhere, so we really started our journey on data integration back with SQL Server 2016, when we introduced the PolyBase feature, and that allows you to issue a query to SQL Server and SQL Server will turn around and then go and query data out of HDFS in like a Cloudera or Hortonworks cluster, for example. And now, with SQL Server 2019, we're making it even easier for you to integrate all of your data without having to move it around. And we have a brand new architecture that's designed to give you huge improvements and performance and scalability through caching and parallel processing, as well as having an elastically rescalable compute and storage layer for SQL Server. - Very cool. Why don't we explain how this all works architecturally? - All right, sounds good. We do this in two ways. First, we're deeply integrating the SQL Server parallel processing engine with big data components from the open source community like Spark and HDFS that are now included in the box with SQL Server. They are a part of SQL Server. That was the big announcement this week. And then secondly, we're going to provide data virtualization connectors for common data sources like Oracle, Teradata, MongoDB, SQL Server, and a generic ODBC connector to make it even easier for you to integrate all of these data sources through SQL Server without having to move your data around through ETL. - Which is really great 'cause big data is now part of SQL Server. Now you don't have to move your data or create separate reporting databases. You can keep it all in place. - Right. So instead of moving your data, transforming it, and creating duplicate data and all that kind of stuff, we've really focused our efforts around data integration on data virtualization now instead of data movement, and that allows you to keep your data in place and yet, still be able to access it from a single point with SQL Server. And the nice thing here is that we can really boost the performance of these queries by caching that data and partitioning it across multiple SQL Server instances that really kind of make that query really run fast. And you can even combine data from multiple data sources into this cache, so you can have a joint, for example, that goes across some data that's coming into that cache from Oracle and from another SQL and from HDFS, for example. - So it's way more secure. You're not making a persistent copy again of that data. There's less maintenance, fewer ETL jobs. You don't have to do ETL in this case. It's more efficient storage 'cause you're not starting it twice. The data's actually queryable live against the virtual cache. It's not gonna go out of freshness or out of date. It's gonna stay accurate, right? - Right, exactly. A lot of benefits there. So, let's make this real by doing a demo here. I'm gonna show you being able to query across multiple data sources here. So I'm running here inside of the new Azure Data Studio, which was formally known as SQL Operations Manager. It's our cross-platform database tool that runs on macOS, Windows, and Linux, and it's open source. And here, I'm connected to an HPE lab where we have an HPE cluster of machines that has the SQL Server 2019 big data cluster running on it. We're gonna go to the Sales database here and I'm just gonna right-click on that. I'm gonna click Create External Table. This is gonna bring up a new wizard that we have that makes it easy for you to do this. I'm just gonna select the Oracle type here. We'll have more types over here over time. And then I can click on Next here. I already have a database master key, so I'm gonna skip that. I'm gonna give this a name. We're gonna pull in the Oracle Inventory database from Oracle. I'm gonna grab the server name. We're gonna go into the XE. We're gonna create a database credential here. We're gonna go to SYSTEM and admin123. Now, what this does is it actually reaches out to that Oracle server and it discovers the schema inside of that database and I can then select the table that we wanna synchronize over. So, we just click on this inventory table here, for example. We can even select the table. We can see the schema about that table here and we can do some column mapping if we wanted to and we can maybe even change the name of the table that we wanna bring in inside of SQL Server. Click Next, click Create, and now, we have an external table that shows up inside of our table's list here and we can now go and query that table. - [Jeremy] And there we go, very cool, very cool. - Let's move on to creating an external table over HDFS. - Okay, let's do it. - Let's do that. I've got a query here that we can run. And the nice thing about this here is that we have a, we got a new query from here, we have a built-in external data source called SQL storage pool, and what this does is this allows us to query the data inside of HDFS using a SQL Server instance that's actually co-located on every HDFS DataNode. It's just built-in. You don't have to do anything special. You just define the read schema that you want us to apply to the files that sit in a particular directory or to a particular file. So now, we can create this table here. Connect to our SQL here. Once we have that table created, we can then just query over that table just like we would any other type of table and right now, what we're doing is we're actually reaching down into HDFS where we have millions of records in this CSV file that sit inside of this directory and we're bringing it back up to SQL Server, just like you query any other type of table in SQL Server. - And there it is. - There it is. - Very cool. So, I can see how this would hugely valuable, especially when you're dealing with compliance requirements. You don't wanna move the data outside of that Oracle database or, in this case, you wouldn't wanna move any of the data actually. You'd be able to query it here. It looks local then effectively to SQL. And the nice thing is you mentioned that Spark and HDFS, they're really in the box. What does it look like though and how does that part of it work? - In that demo we just saw of querying the HDFS, it's part of SQL Server. You can also query that data and manipulate it using the Spark engine that's also co-located on that data node. So now, you have sort of a shared data lake between SQL and Spark where you have all that data sitting in HDFS. So now, what I wanna show you is another demo where we're going to use either the Data Studio, to use notebooks, to interact with the Spark runtime. - Okay, let's do it. - So first off, I wanna show you a notebook experience. It's built into Azure Data Studio here, and this is where I can come in and I can write some code in Python or Scala or R and I can execute that either locally here on my machine using the Python 3 kernel underneath the covers or I can submit jobs into Spark to have it run there. Now, what I'm showing you here is I have this file called Artists.txt. It's a CSV file, but it's got some pretty crazy data structures here. It's got a bunch of stuff up here that defines columns but it's in a weird way. There's a bunch of whitespaces. - [Jeremy] So it's nearly unstructured even though it's a CSV. - [Travis] Yeah, I mean, this is a mess, right? - [Jeremy] Yep. - [Travis] We've got this research stuff coming out of Microsoft research called PROSE that allows you to basically make sense of this data by running some code against it and it will generate code, which you can then run against those types of files and generate a schematized version of it. So to do this, I'm just gonna run this PROSE library here. This is included inside of Azure Data Studio Notebook experience and inside of the Spark runtime and inside of SQL 2019. It generates this Python code for me that when I go down here and I run this, it will load the data from that file into a data frame and, what did I do wrong here? Oh, I missed the pandas at the top. - [Jeremy] Maybe the top line? You gotta have the panda. - [Travis] Gotta have the panda. - [Jeremy] Otherwise, the panda will be sad. - [Travis] There we go, include the panda. There we go. - There it is. - So, the data is now structured. We can actually make sense of it and this makes it easier for SQL Server to read it. Now, the cool thing is I can take this same Python code and I can go over and inside of the Spark runtime on the top of the HDFS, I can start to apply this to files that are over there. So here's an HDFS browse experience where I can actually browse the HDFS sitting inside of the SQL 2019 big data cluster and we can see that I've got that Artists.txt file already there. Now, let's physically go and create a function that will be available to us now inside of Spark and we can run the same thing over that Artists.txt file inside of the Spark cluster and you can see that it now schematizes all of this. And now I can, if I wanted to, save those back as a CVS file that's in a nice, beautiful format, for example. So this is how you can use something like Spark to do all of your data preparation tasks at scale across all of your Spark infrastructure. - Much, much better. Now that we've prepped the data, we've got it all into good format, can we analyze it? - Yeah, definitely. That's with whole point of this, right? So, we wanna build and get to where we can analyze it. And this is where you wanna be able to combine together kind of the different data sets from Oracle, from HDFS, and then feed that into your machine learning services that are built in to SQL Server where you can run Python or R scripts to train your models. You can store that model back inside of SQL Server, to operationalize it, and then just as part of a stored procedure, you can actually store your data against that model, maybe as part of a transaction or a batch processing thing there. So, let's take a look at another demo here. I've got an eCommerce site scenario here that we're gonna look at and we're going to use machine learning services to predict a pattern of behavior from a clickstream of data from somebody who visit our eCommerce site. So, let's pivot over and take a look at that now. So on this one, we got another script here that we've already created ahead of time. Let's bring this up now. So on this one, we're going to, first of all, kickoff the model training. So when I kickoff the model training, this is gonna call this stored procedure here called train book category visitor. It's gonna produce some model for me. You can see the output from the production in that model. Well, let's go take a look at what that actually looks like here. So, what this stored procedure does is it defines a SQL query that will go and get some training data out of my database. And then it will call some R script that will be doing some featurization of this using a logistical regression algorithm. And then what we do is once we run this sp_execute_external_script right here, this will execute this R code against the training data we're sending in and will produce a model which we then store inside of the database by inserting it into the sales_models table here. Once we've done that, we can then go back out here and we can do our predictions against this data by using this stored procedure here. You'll see that based upon all the clickstream behavior that we have sitting in HDFS in raw file, we pull that up, we score it against the model, and we can actually predict based upon somebody's clickstream behavior whether or not we should suggest to them to go visit the book category, for example. So this allows us to bring together data from our inventory system in Oracle and our HDFS data that has the clickstream data in it in raw format, in CSV files, and then feed that into machine learning services inside of SQL Server, produce that model, and now, we can score user behavior against it. - Very cool, so we've seen huge architectural advances here in terms of this direction for SQL, being able to really parse over and analyze big data without having to move it or do a lot of ETL work against it. This is really huge. - Yeah, it's big. (speakers laughing) Sorry for the pun, but yeah, it's big, this is huge, and this is really like a different dimension for SQL Server to scale in to where you get this caching ability, you get elastic scalability of your performance, tiered with compute and storage scalability, all built on top of this native architecture. So it's really, if you think about it, SQL Server is not just a database really anymore. It's really more of like an integration and complete analytics platform that you can use to integrate all these data together. - Right. Again, when we think about cross-platform, it's not only about operating SQL inside of Windows or Linux or multiple operating system types, but also against other databases. So, if somebody wants to learn more about these capabilities and really try it and go hands-on for themselves and connect to Oracle and start using Spark and HDFS, where do they learn more? What should they actually be doing right now to get a hands-on? - Yeah, exciting news. We announced the SQL Server 2019 preview this week. You can go and get that for SQL Server on Windows and Linux. And if you're interested in the big data cluster part of it, we encourage you to sign up for this early adoption program at the link here where we can then assign you to work with the program manager on our team and then we can really get hands-on with you as you try to evaluate these big data cluster technology and we can learn from you, kind of how that experience goes. - Awesome, great to have you on the show again, Travis. Thank you. Thanks for coming, and that's all the time we have for today's show. - Thanks, everybody. - Goodbye for now. (audience applauds) (upbeat music)

Info

Channel: Microsoft Mechanics

Views: 20,824

Rating: 4.9272728 out of 5

Keywords: SQL Server, SQL, HDFS, Apache Spark, ETL, PAL, Linux, big data silos, data virtualization, SQL Server 2016, PolyBase, Cloudera, Hortonworks cluster, SQL Server 2019, Oracle, Teradata, MongoDB, ODBC connector, Azure Data Studio, SQL Operations Manager, Windows, HDFS DataNode, Python, Scala, Azure Data Studio Notebook, Python code

Id: SupdSeLtZyM

Channel Id: undefined

Length: 14min 16sec (856 seconds)

Published: Wed Oct 03 2018