Architecting robust Big Data Solutions with Azure

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Cool. Good morning and welcome to this session. My name is Bhanu Prakash and I am a program manager in the big data product group that builds Azure HDInsight, data lake, and other big data tools. Today I'm going to talk about how to build robust big data solutions. And to me, this means two things. One is basically, why you want to build a big data solution. What are the business problems you want to solve using that? And then secondly, how you want to build a robust architecture of the big data end-to-end solution. So, I'll focus more on the latter side which is the architecture. But I'll also start with the business use cases of why you want to build a robust big data solution, because I think it's always important to understand what are the business problems you are going to solve by a big data solution. So let's get started. So we'll start with understanding what is a data lake and how it fits in your data strategy. We'll talk about architecting the Lambda solution for big data. How many of you here have heard of the Lambda architecture? Okay, quite a few, I would say one tenth..So, I will give some time to define that. And then thirdly, oftentimes what we hear from the field or customers is, hey, you have got three services which almost do the same thing. And which one I should use? So I will talk about the different services that are available and which one you should use with scenario. So let's talk about the evolution of big data. So originally, big data started as a solution to solve large batch problems. So the things to do, like ETL, do it overnight. Do processing of petabytes of data every night. And then get the results and then do the same thing like over a regular interval of time. But the reality is people started to use and aspire to use big data in a way as they would to any of their on-premises solutions. So more like real time solutions doing sentiment analysis and so on. And then when you had batch and real time, it evolved to doing more like predictive analytics, so that business users could not only do batch processing of the data but also real time processing at the same time. Use some of the tools like R or other technologies to do predictive analytics by doing machine learning and other tools. So before going into the architecture, as I said I wanted to start with some of the business use cases and what people are doing in real life. Let's start with some of the examples. So Virginia Tech start was using the petabytes of data to basically process that data to do sequencing of genomes. And what happened was, when they're doing this on-premises, it used to take them two weeks to do one sequencing of genome. And when they moved to Azure, with the busting of HDInsight, they were able to do 20 sequences of genome per day. Not only that, with that they where able to scale their compute nodes when they wanted, when the processing requirement was huge, and scale it down or shut down the cluster. And that helped them to save cost. So, if you're requirement is to use 20 nodes for day one, scale it from 10 to 20 nodes, and the next day, or on Sunday, you don't want to do any compute, just shut down the cluster and then restart the next day. So that helped them save a lot of money. Other companies have been using or plan to use big data more from a BI point of view. In this case, Carnegie Mellon was using big data solutions to save or reduce the energy cost. A lot of the real estate, things like universities, malls, strip malls, large companies like Microsoft, real companies have also been thinking about how they can use big data to reduce energy costs. So in this case, what Carnegie Mellon was able to do just by using SQL, Power BI, and HDInsight, they are able to save 30% of the cost. And that's without even streaming or machine learning. And all of these companies, they have been thinking if they could save even 10 or 20%, which is a significant cost savings, it would help them save a cost a lot. Let's talk about the companies that have been using big data solution to streaming. I talked about batch, I talked about BI and then I'm talking of the streaming part, which is more like real time. So if you go to some of the restaurants, what you'll see is, on the table there is this tablet. And you start browsing the tablet to play games or go to the menus to find appetizers or desserts and so on. And what's happening in the background is that data that you are seeing is being streamed back to the compute engine in real time. So that machine learning APIs can understand or assign a rank or score. And then send that data back to your tablet to come up with some promotional offers. So let's say you are looking at the dessert menu, you order the main clause. You are looking at the dessert menu and the kids are looking at it. It will send you a promotion offer that will say, hey for $20 dessert, for all your family or something like that. And that's all powered by the machine learning APIs. And through that, companies like Chili's, they have able to improve the customer satisfaction. So, this is one way to sort of use streaming and machine learning to improve the customer satisfaction. So, what would an aspirational demo look like when you combine all the pieces, the batch, real time streaming, machine learning. What it would look like from an end-to-end solutions? So we have a demo that I wanted to show before going into the architecture to see the possibilities that big data tools can enable. So let me just. So Woodgrove is a financial company, and I'm going to talk about two personas. One is Jeff, who is a financial advisor. And then we'll talk about his boss, Rob, who manages the portfolio of his, there are lot of financial advisors report to him. So in this case, Jeff will logon to the app and then see how his clients are doing, get some real time information through news, through alerts and so on. And then make some business decisions based on that. So Jeff is out for lunch. And then he gets alert that Contoso earnings dipped irregularities form due to the carbon dioxide test. So he wants to know because of this news are there any clients that are impacted by this news? So he logs on to the app, and then in the app, you'll see there is this real time stocks data that is coming in through even [INAUDIBLE]. And then, there is these news videos and then the stocks. And he also sees different sectors. So since Contoso, in this case, is automotive-sector company, he tries to see which are all the clients that are impacted or have investments in this automotive sector. And then he sees Ben Andrews is on the top. Who has invested and who seems to have a negative impact based on the investment that he has done. So, he clicks on his profile, and then he sees the CRM data about his age, his phone number, the last meeting, last call. And he starts seeing all the data, his investment and some of the KPIs in terms of performance was a target, the churn score, the risk score. So all the things that you see on the right, the churn score and the risk score. These are based on machine learning and then all of the things that you are seeing, these are the real time streaming data that's coming in. And then these are the CRM data that I talked about. So, he clicks on the churn score and then this goes and opens up a Power BI report. And to this report, he wants to take and analyze what's going on. And all this BI report is being fed through a SQL database or a SQL data warehouse. And then he tries to, he can see like the Churn Analytics, the Share Volume, Real-time Stock Trend, and so on. So, when he goes down into the total transactions of that, what is transactions? What he sees is that Ben is the guy who has had the max symptoms of the total transactions among all his clients. So, things like that is a sort of an indication that we've been moving away from being his client because of all the news that he has seen. So, it goes and then clicks on Customer Channel Art and he makes a note that he will talk to his manager and then discuss with his client to make sure he can retain Ben as a client. So it's okay, and then goes back. And then now he sees like Ben is marked with a flag both places to make sure like he remembers whenever he logs into the app through his phone or to web app, and he makes it in order to talk to him. So this was one of the personas. Let's talk about the other persona. So Rob as he talks to his manager, he signs in, and he gets a data which is important to him telling to. So here's the streaming data, this is the stock quotes, see all those things like the top performers, the bottom performers, and at the same time he wants to make sure that the system has to be compliant. And all the alerts, that he's seeing is, again, powered by the machine learning APIs. The machine learning algorithm starts happening in the back and so there's some thing called suspicious trading. So he clicks on that, he wants make sure like his director [INAUDIBLE] and clients that he have are not doing anything illegal. He goes to the power review report and so he looks at like suspicious trades, individual stocks and so on. So on this he sees a chart. And then let's say you select for Michael Alan for this security stock. And it says like Michael has done a transaction, which a lot of transactions where the value's a lot, would happen just before the stock, like went to its peak. So, that was sort of a clear indication that maybe that he had insider information and that is the reason he did lot of transactions. So he goes ahead, clicks on Alert Compliance, to make sure his compliance team investigates, to make sure the system is compliant. So it goes and then locks back. And this was sort of an example where we talked about how a system, a big data solution can be built in a way that it has everything built into the application. You don't have to go to different things, but everything is built into the business application that your users, business users, are using and pass them. So let's go back to the architecture. So let's talk about how we want to build this in the right way, with the right architecture, with the right tools, that will enable a solution similar to the demo that you just saw. So oftentimes when I get into call with the customers, our field, one of the questions. I would say three types of questions that falls into three categories. The first one is, hey, I'm not sure if my architecture is correct. Can you take a look? And part of the reason for that is they are not sure if they have to use events hubs or Kafka, SQL DB or SQL data warehouse or data-like analytics versus SD insight, stream analytics or spark streaming. And that leads to second category of cushion and I talk about going through all these different services and which one you should use. And then thirdly, there are situations where a customer goes based on the knowledge or information they have. They start building their architecture and test them, and just two weeks before the production they come back and say, hey, I build the solution and we are going to lie into production two or three weeks. Can you please take a look if my architecture is right or not? And so, we'll walk through how these different questions of how do we want to answer these different questions and how it helps to sort of define what architecture you want to build. And frankly speaking, these would help us, so it comes to define, like the different code principles are there for building a big data architecture within Azure. So the first one is decouple the data bus, and we'll talk into detail about it. The second one is to build the Lamdba architecture for the robust big data solution. And then, the third is choosing the right tools and services. And the fourth is, understand the importance of managed services, whether you want to use on IS or PaaS or SaaS. So we'll talk about that. So, decouple the data-bus. Let's talk about it. So, on, in the on-premises world, one second. In the on-premises world, if you look at people using Hadoop the steps in wall would be to first sort of scope the node cluster size. So whether you need two nodes, ten nodes or so on. And once you have then you get the open source bits or bits from Cloud data. And then you build it, you configure it. You have to make sure all the services are up, running. You have to then administer it, monitor it continuously. If things go wrong, you have to make sure the VMs and the nodes are up and running. And then you later deliver data and then queries. A user add queries to take meaningful information. But this, to again build it to get the hardware and also getting it approved from your company it would take like a couple of months or so. And once you are at this stage, if something goes wrong you have go back to stage one just to meet the size, how many notes you need. And then do all the steps. So let's say you start with two nodes and at this point the storage and compute isn't the same thing. So there is zero CPU utilization, zero storage consumed. And then after six months, you'll see that the storage because the data has been continuously added. It's almost full. Compute is like 25, on an average 25 [INAUDIBLE], so you have compute, there are a lot of queries. But the storage is almost 90% filled. So you think let me add more nodes, so you add two more nodes. And then storage is almost like 50% or 40% your clients computer like here a little bit around the same lines. So now you're doing fine and then you start having more users more queries. So on Day 400, it's like compute max hour is like 100% peak and then 75% average, but the storage is fine storage is not the bottom line. And then you think about, okay, I need to add more nodes, because I can't do any more processing. So you go ahead and then double it again. And now your storage is also sparsely, there. And then compute, is also fine. But this was just an idea of the fact that because storage and compute are in the same thing you can't scale them independently. You have to add both parts even though one of the is the bottle neck. So, what this means is in a typical big data pipeline when I say decouple the data-bus, you have to have a separate store and you have to have a separate compute. The storage some of the solutions we'll talk about, what it has seen in the previous example that you saw was when storage was the bottleneck, you are adding storage and compute together, because they were from the same part. When compute was the bottleneck you are adding, storage and compute again together. With this sort of of architecture when storage and compute are separate, you have to scale them up or down independently. And so we'll talk about how the data is ingested, through bulk ingestion and the event ingestion. And then of course, it has to be consumed by the business users. In this case more that discovery and visualization, they are different. By discovery what I mean is basically how you want to enable the business users access the data through making sure they have got the paths documented or you're using tools like Informatica and so on or data catalogue. And then visualization is what? You enable the business, users to run like interactive reports just like the demo that we saw where Jeff and Rob went to the Power BI report and do some real time based on the machine learning information, they were able to take decision. So visualization is like power through Power BI or Informatica, or Tableau. So let's starts all the product examples of how this would fit in. So far the bulk ingestion will talk a little bit about Azure Data Factory and how it helps in moving the data, will spend a fair amount of time in talking about Azure Events Hub and Kafka. And then for the storage, we'll go into in depth about talking about the Data Lake Store and then we'll talk about the compute using different applications. We'll talk about HDinsight, we'll talk about Data Lake Analytics, we'll talk about Screening Analytics and Machine Learning. And then for visualization we'll talk about Power BI and SQL DB and SQL Data Warehouse and how and which one you should use. So Data Lake Store. Data Lake Store is generally available. So until last year it wasn't public preview, I think in November we made it generally available. It's available in couple of regions in the US. So, basically Data Lake Store is something that we built from ground up, used on the learnings that we had for internal storage that powers lot of internal applications within Microsoft. There are no limits to scale. What I mean by that is you could literally store petabyte file size and millions of files in the same storage account. The throughput is good. You could store data in its native format, and I'll talk about that in a little bit. It has got all the enterprise-grade access controls, the role-based access controls. You can set up a firewall to make sure a defined set of IPs could only access the Data Lake Store. There is encryption at rest. And when I talk about scalability, we'll see performance difference, especially when the size of your the size of your data is big compared to any other similar tool and does the same job. So Azure HDInsight. Azure HDInsight, how many of you here know about Azure HDInsight? Okay, a few. So, Azure HDInsight is a passe kind of service for open source Hadoop and Spark. What I mean by that is, the example when I showing you have to purchase hardware and do everything, install the services and do on. In this case what you have to do is just go into the Azure portal, you select the type of the cluster that you want. Whether it's hive, we added Spark, I think last year. Kafka is in public preview, Storm and so on, for our server. So you select the type of the cluster, gives some inputs like how many worker nodes you want, what type of VM you need, D series or A series and so on. And within a matter of few minutes you get your cluster up and running. And so if your needs change, you want to use it let's say just for batch processing, run it and then delete it and then recreate it when you want it. If you want to use it for real time, let it run with the right amount of nodes. If you want to scale it up, you can scale it and if you're you can scale it down just by going to the slider and then writing the right number of nodes. It has support for Visual Studio, Intelligeck and all sort of your existing development and environment. We are constantly evolving as the insight has been generally available for a couple of years and so we recently shift secure Hadoop. So it has got all the enterprise features like monitoring, it also supports ranger. It has Active Directory authentication Integration so you could have the right set of security built into the Azure HDInside cluster. And the fact that you don't have to pay for the hardware, you are only paying for the compute and then for the storage you are paying separately for whether you are using Azure Blob Storage or Data Lake. You can save your total cost of the ownership, because you will be paying only for the time of the compute that you are using. If you want to have just two nodes, you are just paying for two nodes and not ten or the number of nodes that you started from. So I talked about the fact that we also added R server which means, compared to the open source R, it has like thousand x more data capacity and 50 x more performance compared to open source R. And Kafka is in public preview so we'll be going into GA soon up with Kafka. And it has support for the right set of BI tools that you need. So Data Lake analytics. So, as I was talking about HDInsight which is more like a passe services and you don't have to take care of the VM, no patching and so on. Azure Data Lake analytics, is something more of a size I would say, or more like a job service. So, in this case you don't have to even spin up a cluster. What you have to do is literally go to the Azure portal or go tthrough Visual Studio. And the you write up your query, using SQL like with a mash up of C. And then you deploy that query and then we take care of running that query on a shared or unclustered which is managed. And then based on how much parallelism you want, you choose that level of parallelism whether like four vertices, ten vertices, that will define how faster your query will be executed. And then you pay for the time the query runs into the level of parallelism that you have chosen. The good thing about both Data Lake analytics as well as Azure HDInsight is, we provide for Azure HDInsight. We provide best SLA in the industry which means it's not only your VM instances, but the end-to-end cluster, we are making sure that it is up and running 99.9%. So if your VM goes down or if service goes down, we make sure it's up and running. And we have an on-call team, to help support that. In fact, if you're familiar with Hadoop, there is something called NameNode. In traditional Hadoop, you have to go, the default is one node. But as the insight we have like made it highly available so if one goes down it like fails, so to the other one. And so talking about the Data Lake analytics, it has the rule based access control. It federates across the different data sources. So what I mean by that is when you run your query in Data Lake Analytics, you don't have to copy data from database to Data Lake Storage, or Data Warehouse to Data Lake Storage or somewhere. You run the query on the Data Lake Analytics cluster and then we take care of integrating across different sources, augmenting the data and there isn't any duplicate copies of data in that because that you wouldn't want to do. So, let's talk about some of the user's scenario here. So, before like big data came in and some of you might be familiar like we have been one of the scenarios that Hadoop and other tools like HDInsight or ADLA. And this is doing ETL. So what I have seen, some of the customers, they are using ETL at the sources where data come in before feeding that data to SQL Data Warehouse. My recommendation would be to stop doing that because you could use the parallelism offered by HDInsight and Data Link analytics to do ETL right there instead of doing it on the sources and then using Data Warehouse and database to do the ETL. Because, let's say if you are using Hadoop to, let's say run your job in 10 nodes or 20 nodes or 100 nodes, you could do the same power and use the power to distribute it and parallel computing to do your ETL jobs. And that's why what I mean by that is you load your data into Data Lake store or blob storage, let HDInsight or Data Lake analytics do the ETL and then you store it back to data-like store or other storage to do machine learning and other processing on it. But you don't have to spend your existing data warehouse or other tools to do that for you. So, let me tell you what it could mean if you were to move from On Premises to Azure and what a simple architecture might look like. So here, you load the data into Azure Data Lake store, and then you have two compute nodes. Let's say you have added two clusters HDInsight clusters or you're doing processing through ADLA, Azure Data Lake Analytics. And then through ADF you are pushing that data to SQL DB, which kind of drives the power BI, which is the source of the data for Power BI. Now what happens is your storage needs scale, so your data grows and Data Lake store grows, and at the same time your compute needs also grows. So Azure Data Lake Analytics and as the insight also grows. And then, one other things you will see is in the past you were using SQL Database, and now you are using SQL Data Warehouse. And I talk about which one to use or which one, but SQL Data Warehouse when your storing needs to grow more and like 1 terabyte and all. You move to like SQL Data Warehouse and the good thing about this architecture is each pieces of the puzzles should be able to change as far as with other equivalent services that support most scalable or those scenario within impacting the rest of the system. So in this case, you're allowed to move from SQL DB to SQL Data Warehouse, and Power BI which was working with SQL DB now works with SQL Data Warehouse without any issue. The same for Data Lake Store. It was feeding data to SQL DB and now it's feeding data to SQL Data Warehouse and it works fine. And then, you added more nodes so you're about to scale your compute independently. You're able to scale your storage independently. You're able to scale your presentation layer, which is SQL DB to SQL Data Warehouse, and that is rule number one of building a robust solution. That is, scale your components without impacting the rest of the system. And then, in the very future, we will be adding support for PolyBase, and what it will allow is to parallelize the feeding of data to data warehouse. That will make sure it goes to the data warehouse much faster than it would with Azure Data Factory and that way, your business users don't have to click and then wait for five minutes or two minutes. It will all real time and much faster. So Lambda architecture, I talked about the fact that we'll about this. It was described by a guy, I think, five or six years back. What it tells is, you load your data, you do batch processing, and then you do real-time processing in the same environment. So you are getting the best of both worlds in the same environment. You don't have to do batch there and then realtime here. You're all doing it in the same solution. So how do you go about building architecture lambda architecture? So generally speaking, it's in four layers. You start with the data consumption which is the bulk ingestion or the event ingestion. And then, you go through preparation and analyze ways, which is more like transformation, extraction and all that. It has speed layer and it has batch layer, both are equally important. This is sort of the computer part of it. Compute as well as the storage, which are all separated. In the batch, what I was saying in the past was, using your ADLA or HD Insight to do batch processing, this is what it refers to the batch layer. Where you use HDInsight and then do processing of your queries and let's say it takes 1 hour to do the queries, you can just scale up your lots and do the same query in like 50 minutes. And when you create a tool on on-premises world, it will be like a real time but in Azure or in the cloud world it will be, I would say, relish time. And then, on the top which is the speed layer, when you talk about IoT scenarios where the data is coming at a very high rate, you add the speed layer. So the tools that will help process and store the data at that rate. And you combine both speed and batch layer just like the example of the demo that we saw. And then, the writer's representation which is how you consume the data through Power BI, Tableau, and other tools. And make sure tools like SQL Data Warehouse is sort of the back-end data for the presentation. So let's talk about some of the concepts of the requirements for the Lambda architecture. Scalability. We talked about this sort of different examples, so we talked about the fact that you can scale your HDInsight nodes from 10 nodes to 20 nodes just by going through a slider, by doing manual resize and so on. Saying you were able to do it for your storage, or you are able to do it for Azure Data Lake Analytics just by going to Visual Studio changing the slide order the portal, adding more number of vertices loading the number of vertices and so on. So basically, each piece or each component should scale up or scale down and that helps you to save and manage cost. Resiliency which is basically the for something to come back if something breaks, and both Hadoop and other components have resiliency building to it. For HDInsight, we say especially as compared to Hadoop it has more resiliency building, the fact that we have added the main node which is highly available, there is automatic failure that happens. The VM instances will be up and running 99.9%. That's not just for a VM but whole into cluster. And so, it is very important to make sure you have a resilient service that supports this architecture. When you talk about the latency the fact that specially in case of Kafka, the data has to come at a very fast rate and the system has to support law latency for that. Extensible by that handshake, what I mean is the fact that you are able to swap database with a data warehouse only available to change from like HDI to ADLE and so on. So the system should be able to swap the right side of services or the tools for the purposes that you need without affecting the end to end scenario. And then, of course, it has to be maintainable, in the sense like it has to be, it has to have the ability to upgrade. Or to apply the patches to the VMs. And then, you have to make sure that you are choosing the right set of tools to do that. For example, let's say you have the option to run Hadoop or an IS manner, so you go and use the IS VMs and then install Hadoop on that, and then the other options is you directly go ahead and use as the Insight service. So a HDInsight, the west patching and all, making sure like it's available 99.9% and all that is already available. But in the IS you have to do some sort of things to make sure it gets to that level. And what if if you are going more and on but it says you have to do literally everything. So you have to make sure what level of manageability you need and how would you want to maintain it so the rights of the tool needs to be there. So let's start of the first part which is data consumption. So we talked to the fact that it's basically the first layer where the data is injected into the system. So let's talk about the event ingestion. So you have data at the very high rate that's coming through business apps, or web apps, or your custom apps, devices, sensors, lot of IoT scenarios are being enable in slight very hard this days. So the data is basically coming at a very high rate. And then, so here, you have two major options. One is Azure Event Hubs. And it is a really simple-to-use service that allows you for getting the events from different sources. And then, Kafka which is the open source Kafka which is something that you can manage as a past service and the fact that it's in public figure so you can use either of this services to inject the data. And then, for processing or doing analytics on that, we have three options. The first is Streaming Analytics and will talk about all this, which is again sort of job service, a SaaS service through which you can do processing of the data when it comes at a very high rate. And then, you have open source Storm and then Spark Streaming, we shall talk about. And then once the data is processed, it goes to the Power BI for consumption. And then, you also want to make sure that the data lines sync to ADLS for future machine learning or other processing. So let's talk about ingestion technology. So it looks like there are two options, Kafka and event hubs. You come across different tables as we compare and contrast the service, and I [INAUDIBLE] that shows the different shading factor. In this case, the reason I have made Azure [INAUDIBLE] compared to Kafka is the fact that in even Tableau you don't have to manage a cluster as a past service, you don't have to go ahead and create a cluster and so on. In Kafka, you have to do that. But is a managed service, so it's a very simple to use service, Azure Event Hub. You are looking at a throughput of 20 throughput units. The cost is low. You have to pay it for the usage of that stream, and then the record size is 256k. Both of these are sort of on a in the fact that they could process like millions of events per second. But if you are looking for something with a very highly scalable needs that could process multi billions of events per day, especially like connected cost scenarios, you go with Kafka. But if you are looking to just start with, you start with Events Hub which is pretty simple. But when you need change and you're looking for more, processing of events at a larger scale, then you would need to use, you should think about using Kafka. So we talked about the ingestion, let's talk about the other components of the streaming analytics. So we talked about the fact that it's getting injected to Event Hub and Kafka and now it goes to like stream processing. So there are three options in this. Stream analytics, storm, and spark streaming. Stream analytics like the ADLA analytics is, well ADLA is more like a bad thing, this is more like real time streaming. So you do processing using like SaaS based sort of service. So It's very simple and easy to use, you could use like windowing aggregation from out of the box and you don't have to like have a cluster up and running in the case of Storm and Spark. If you're going to just start like using this, any of the streaming processing start with ASA, which is pretty simple to use and you can always move to Spark streaming and Storm. One of the things that I would point out in the case of Spark and Storm is the fact that these are open source technologies. So the ecosystem is pretty huge. And what I mean by that is if you have any questions or anything like that, you could basically find answers on the stack or floor on the same devices, asking the questions and it takes awhile to get answers. And then, Spark and Storm I would also say it's more you would think about using it from highly scalable scenarios. You can manage the costs and all that, but from the ease of use perspective, from a perspective, you should start with something like stream analytics. So how does Event Hub and streaming sort of handle the scalability scenarios? So what Event Hub does is when the events are coming, it partitions into different PartitionId. And then the egress stream analytics consumes it, you can partition it by PartitionId. And that way instead of running one single query, you are essentially running multiple queries. And that way the speed of the clusters increases. What this would mean is, you would want to use like some machine learning and streaming at the same time. Which is good, just like the example where we saw like streaming happening and then machine learning and so on. But one of the caveats or the down side is like the performance might go down because of the fact that now your data is being streamed, processed by Stream Analytics. And before it goes to the Data Lake store of application, now you are using ML APIs to assign a rank or a score and then ML has to process it. So it would decrease the speed of the rate at which the events is being injected to Data Lake or application by two or three times. And the recommendation there is like if you really wanna use it, then use the fact that you could partition it by PartitionId and then have multiple queries. Or the other option is like you let it go to the Azure Data Lake Store and then do the machine learning in a batched manner and then let it feed into the application. So let's talk about the data store. So Data Lake Store, we sort of touched base on this in the fact that there are no limits to scale. You can have petabyte file size, millions of files in the same storage account. It can be there for as long as you want. Right set of security is built into it. And then seamlessly scale from like K-Bs to petabytes. So some of the technical requirements that were chosen while designing Data Lake Store. As I said, it was built in from some of the learnings that we had in-house. Or the fact that security, it has the right level of enterprise security linked into it. So the fact that it prevents unauthorized access make sure the right sort of IPs have access to it, it has files and full level access. It's highly scalable and it supports both like real time batch machine learning, it's like the sole source for all the different types of parsing data that you would need. It supports low latency and reliability is built into this. So, I talked about the fact that, we'll talk about the data types that can be stored in Data Lakes so here is sort of perspective on that. So, when you talk about big data you have to store elect the data to come as it is coming from different sources. So you don't want like to change the structure and all that because you want to have all the data that comes in in any format that's there. So start with the unstructured. So you're storing the logs, you're storing the pictures, and all that. It also has support for semi-structures, so basically JSON, XML, Avro, and different schema things. And then structures, basically CSV and other structured formats, it supports. So when you are using Hive, one of the recommendations there is to save the data in a columnar format. One of the reasons I say is when you are talking about big data the tables are so wide, if you are using the row best format, you would have to scan the whole row. And that way because we are scanning the whole row even if you are doing like select column A, column B, or so on, because of the fact that's scanning such a wide row it takes a lot of time. So imagine if you have to scan so many rows versus just scanning the two columns that you're interested in. So that's why the recommendation is to save the data in a columnar format. The downside is, definitely, you'll have to convert into that format. But yeah, from a query perspective, it's pretty fast if you say I have that. And there are two formats in that available. One is the ORCFile format that Microsoft built in conjunction with Harden Works and open source community. It is best for Hive, it allows all ACID properties insert, update, delete. There is also the Parquet format that you could use for mixed Hive and Spark workloads. So let's talk about choosing between Azure Data Lakes Store and Blob Storage. So Azure Data Lakes Store, one of the things is It is not available in all the regions. So if you are someone in let's say, Europe, or Asia Pacific, the option available is to use Azure Blob Storage. And if you want to start in the US, there are some of the regions which have Azure Data Lakes Store. The thing is, both of them are generally available, but we have hundreds of customers who are using Blob storage. And so if you are looking for something like petabyte file size and that many number of files, you could have like multiple Blob storage accounts to deal with that. The good thing is like compute tools like XT insight or AD Azure DataLake analytics, the performance does not impact even if you have multiple storage accounts. So you have like multiple storage accounts, if your scalability needs up that high. But if not, basically with one or two accounts that would serve your purpose. Other thing to talk about here is the fact that for the Data Lakes Store,this the storage that we expect in the future most of the customers to be in like to the most of the customers are using Blob storage. But we expect, over a period of time, to move them to Azure Data Lakes Store because of all the scalability benefits and all that. Blob will have some of the bandwidth issues and so on. But at the same time, if you are using like multiple storage accounts, the performance, from compute sort of perspective, would not suffer. So, often times, we talk with customers like how you want to store data for enterprise robustness? And one of the recommendations is that you store the data by organizational folder structure, application folder structure. And so one of the questions that customers ask is, hey, I'm using this Azure Data Lakes store account, and how, I have to have multiple business units that want to store data, my HR, e-commerce and finance department. How do I go about building them separately? I don't have a good answer to that other than telling like, hey, go ahead and use the friends storage accounts. And in that way you have the right sort of making sure different accounts, like HR has it's own storage account, e-commerce has it's own storage account, so you don't have to like have different accounts and all sorts. But at the same time, you are able to build them separately. And the fact that I talked about it does not impact the performance no matter how many number of storage accounts you have, you will refine the data approach. The other thing is, like of course, you have different folder structures for applications and business events and so on. You want to make sure the data is at the granular level of access is partitioned based on the year, the month, and the time, and what level of partition do you want to go up to. And the prime reason for that is when you are running your queries like Hive, you only scan the files and the folders that you are really interested in, versus scanning the whole thing. So that helps to sort of also have the right structure for the data and at the same time helps in performance. So other thing I talked about in the presentation are the fact that there is this data discovery. For business users who are not so tech savvy and want to have access to the right sort of data. Always make sure the parts are documented correctly. I've seen examples where customers have used acronyms or crazy names. And if there are new employees in the company, and you are not used to that acronyms, they'll find a hard time. And that's what you call dark data, that data is not being used. So you want to make sure even though you have option to use tools like Informatica, Data Discovery, Data Catalogue and all, or do you want to make sure, the smart way of doing it is to make sure it's properly documented. And profiting after the fact is difficult. So when I'm talking about making sure you have the right set of structures, for files, folders, and so on, you want to build it right the first time. If you don't do it the first time, you can do it later as well. But you have to give enough time, and it is a cost for you. So you want to make sure it's correct the first time. So big data storage for enterprises. So you want to make sure that you have the right level of isolation for the data for the different needs. For example, you have to have isolated source for the data which is raw. So all the raw data that comes in, there are no right permissions given to developers or business users and so on. It is just a raw data for the data to come in from different sources. Then you have the staging where it's copied to a staging environment for developers and test. They are using it for test dev environment. And once the vetted data, it is vetted and trusted, it goes for, in production in trusted area. So where the data is only used for production purposes. And then you have a sandbox environment where you have the data which is copied to the sandbox environment for like data scientist and business users. So they are doing hidden trial and one in different ways to create the right sort of BI tool, BI use cases. And once they find that one out of the hundred, they move it back to the staging. So it could be pushed to the trusted or vetted region to be able to run in the production environment. So these are some of the access characteristics based on.These are different classification of data, like hot, warm, cold, between different characteristics. The reason I wanted to show it was because, know that it's not just volume or item size or one of these that defines whether data is hot, warm, or cold. It may be that the data takes a long time to get the data, and you're fine and that's like cold data. So it's also the volume, the size, the latency and different factors that helps to sort of categorize the data. And then what I've done is i have put those different characteristics on a Venn diagram. And the reason for doing it, you want to make sure you are using the right storage based on the right kind characteristics that you need. You don't want to, just because of the fact that you have a Blob storage account, you are putting all the data into that scene. See what the scenario is and put the data in the right storage. That may mean like you have multiple things but that also means like you are using it to build the right trouble solution. That will serve for the right reason at the cheapest cost. So let's say in this case, if you're looking for something, let's say, high in structure, you would go with SQL DB or SQL Data Warehouse. But if you are looking for something very low latency, then you would think about using Cache or NoSQL database. And if you are looking for something like cold data and low in structure, we'll start with thing like HDFS or Azure Data Lake Store. So what this means is to categorize this a little further in terms of tools. If you are looking for structure and relational database start with the Data Warehouse. For complex even processing, you are using the stream analytics and all that. We'll talk soon. And then for our non-relational, you are using the new SQL Data. And then Data Warehouse, so let's say if your size is, data need size is less than one PB, you are using SQL DB. And if you are using it for a highly scalable scenario, you want to use Data Warehouse. And then we talked about this, so I don't want to go into it again. But one of the things I want to point out, with Spark Streaming, you are basically using it for the fact that it has very low latency. And if you are already using Spark, it's like a one in all solution. And I'll share, go little bit deep into it in the next 15 minutes. So this is something from a processing point of view like different tools. So if you're looking for high SQL compatibility, it's SQL Data Warehouse versus like a high query latency. It would be like MapReduce in the case if Insight or a low query latency it will be like as Insight with Spark. So, in terms of the big data pipelines, one of the things i would tell is, customers will always go and think about hey, let me build this 40 node cluster. And I would use it for like ETL and the ETL processing will run overnight for eight hours. And then in the daytime I would use that same cluster for doing user processing. And so they come back and say, hey, I'm looking for this use case. What is the right number of nodes that I would use for these scenarios? And the right answer is, why just one cluster? Why do you want to spend like 40 node cluster for two different scenarios, when you could serve the same purpose by using two different cluster types? And what I mean by that is, the fact that your ETL will be done with 40 nodes, you are literally using that same 40 nodes for the rest of the day. And remember, even though the fact that compute nodes are running, you are not doing any processing, you are paying for that. So why you want to use that 40 node cluster, even though the fact that for the user queries, you may end up using only four nodes or eight nodes for the rest of the day? So the recommendation for that case would be use 40 nodes for ETL during the night like running for data most of the time that you used, shut it down. And then for user queries you just have eight nodes that you want to use it. The fact is when you shut down the 40 nodes cluster you are basically not paying for it. And then have the eight nodes just for the user queries for the that you want to use it for. And in that way you are paying only for the uses that you are consuming in different forms. The good thing is both can access to the same data. So it's not like for ETL, the 40-node cluster you have to have this data and then for the user code you have to have this data. They can also access the same data. And in that way, it could help serve both the scenarios. So, now we talked about all the different things. Let's talk about interaction like how the user, business users, will consume that data. So, on 2017 and one of the predictions from Gartner was that by 2017, a lot of the user will like to use the interactive layer using a self service tool and so on. And so you don't always see it happening. There are two options for that. SQL Database and SQL Data Warehouse, that would serve as a backend data for Power BI. What I mean by that is when you are looking at using these two options, SQL DB as I said it's more for something we're looking for data less that one terabyte. For SQL data we're also testing all the properties that you saw for as being sight or EDLA computing. The fact that you can scale it up, scale it down, you can pause in seconds. You can scale up the storage and the compute independently. And all these characteristics are part of the SQL data warehouse So you could run like T-SQL queries that you are already familiar with. And the fact that it has integration with PolyBase, and that one that will be coming for ADLS in the nearly future will make it even yet more powerful. So it's basically running SQL on SQL, that is the backend being SQL Data Warehouse and then you are running SQL on top of it. So let's talk about Data Lake and Data Warehouse. For some of the things that you might be thinking is, hey so I heard about Data Lake. I heard about Data Lake, Data Warehouse. Now which one I should use? And the reality is if you go back to that base architecture, base Lamba architecture, where you saw presentation, and then you saw like Datalink. You can use it like complimentary. Datalink is like complimentary to the Data Warehouse. When I say that is, it goes back to the things that I discussed in the past where you want to use the parallel processing offered by tools like ADLA or HD Insight to do the ADLs. And if you don't want to use Data Warehouse for doing all that ETL, the Data Warehouse just serves as a data For your presentation layer. So you don't want to scale it to do ETL jobs, you want to scale it based on the number of users you want to support for interactivity. Because Data Lake is there to serve the data, which is the raw data, or the fact that you have to do ETL using the data in the Data Lake Store, you can put it back into that. And once everything is structured, you then pass it back to the data warehouse. And again, I repeat, with that, it might mean having multiple tools. But it just means building out our worst architecture in the cheapest possible way. So we talked about this on the Data Lake, using Data Lake and the right sort of compute tools, like HDInsight or ADLA. You are ingesting, you are cleaning the data, cataloging. You are doing machine learning and all that, and then you are pushing it to data warehouse or SQLdb to do structure analysis starting at the backend data for Power BI or Tableau, and then doing quick aggregations and so on. So the fact that PolyBase support would be there, it would mean like the data is being injected at a very powerful rate by using parallelism. And you could use it to, you could use for both sides where the data is one on loop and then other data warehouse integrates very, in a nice manner. So we did not talk a lot about Spark. So let's use this time to go into details. It's like the newest, I'll say darlings to the big data world. And it has the promise of having everything in the same environment. So it has Spark SQL. It has support for Spark Streaming. It has machine learning libraries. It has GraphX. And the thing is, it is very interactive. It's highly interactive with all-in-one tools. So we'll talk about some of the, like how it compares to other tools. It is available in Azure, Azure Spark cluster so you have to manage as a fast service on HDInsight. So just like you go and create a cluster there is a cluster type spot which is supported on Linux. I think the latest is 2.0.1, which is much more faster than 1.6, with a lot of bug fixes. It has the same SLA offers that you would have for any other HDInsight cluster. And so, it is 100% open so it's taken from Apache, and we are using the Hortonworks data platform base for Spark, just like other HDInsight cluster types. You get the 99% SLA guarantee that has got the right sort of certifications just to make sure the cluster is compliant and all that. You don't have to do either way. It has lot of community support being an open source. You would be seeing development happening in this sort of very fast rate, all because of the fact that it has the promise of doing everything in the same environment. So why would you not try to take advantage of that? So it has support for even Scala, we have the Jupiter Notebooks support and Zaplin. And for the development environment it has the support for IntelliJ and Eclipse and you could do it through the programming languages like Scala, Python and so on. So I talked about this, one of the things you want to consider is when you want to contrast it with other services. Let's say Spark SQL, like the fact that you have everything available and if it works for the scenarios, that's fine. I would consider it relative to SQL Server on-premises, like when you have the license, you want to use it for SSIS or analysis service or SQL database, and it works for you, go ahead and use that. And the same thing applies here as well. But the thing to consider here is it may not be as robust as some of the general purpose-built solutions. So let's say you talk about Spark SQL. SQL in this case is not like the first-class citizen in Spark compared to ADLA or because when those were built, they were built with more SQL in mind compared to Spark SQL. But again, being an open source things are, development are happening there. Spark Streaming, when you compare it with stream analytics, it does not have the same sort of robustness as compared to zero stream analytics. When you think it from ease of point of use. And then when you talk about machine learning, it may not about the same level of parallelism as you would see on the open source R, which is, again, available on HDInsight. So you want to make sure, you want to use some general purpose tools or whether this thing works fine, because of the fact that you are using the same set of environment. The good thing about this, as I just talked about, interactivity. Especially on small datasets, it's highly, highly interactive. It will feel like all those things are happening in real time. For large datasets, because of the fact that it uses in-memory computing, you need to have SSD caching and so on. And then going back to the old school is the fact that machine learning uses sampling and so on. It has like some training data sets. So there in-duilbt sampling that is supported out of the box so you can use it especially if you are having issues because of the last data sets. That can help combat those issues. So what does it look like putting it all together? You have the data coming in from your on-premises while at different places. And then it goes into Azure Data Lake Store or the blob store. And then you do compute using HDInsight or Azure Data Lake Analytics. And then on top of that, you are doing machine learning, which then feeds to the published side, which is the Azure SQL Database, or SQL Databases, or SQL Data Warehouse. And then you use it to feed the data for Power BI so business users or data analysts could use it for the meaningful information that they want to use for. Let's add more complexity to it, and this is sort of the [INAUDIBLE] architecture that we talked about initially when we were talking about the ingestion layer, the speed layer, the batch layer, and the presentation. So right now you have not just batch events or the data that's coming in, you also have the streaming events that are coming in through Event Hub or Kafka. And then all your data is going to either Data Lake Store or Azure Blob Storage through the different tools that we talked about, ADF or even Server Kafka. And then you are doing things like compute using HDInsight or you're doing compute using ADLA. And that is like batch processing, on top of that you're doing machine learning. And then you are going to the streaming layer where the data is coming at a high throughput rate for IoT or other scenarios. You are running tools like stream analytics, part streaming to do analysis on that real-time data. And then you are adding on top of that machine learning and then feeding the data back to Azure Data Lake Store so that you could do some back processing on that same data. And through all these different sources, the data is going to Azure SQL or Azure SQL database and then which you would use as a backend data for Power BI or Tableau reports for the business users to make meaningful decisions from it. So what does it mean from the manageability prospect there? So this is some of the last part I was talking about, from a manageability perspective. You would have seen this chart in a lot of the conferences. But one of the things that I wanted to point out, since you are talking about building a robust big data solution, is you want to be all towards the greener side, towards the right side, than towards the left. So let's take example of HDInsight. It would form as a platform of service for says ADLA, which would be more like a software service or in between. And versus Hadoop on-premises which is like the first all blue. So in the first case like Hadoop on-premises you have to make sure you have to have buy the hardware. You have to install the Hadoop. You have to manage it, administer it, and do the whole thing, versus if you're using HDInsight, all the things, like runtime, is all taken care of by HDInsight, which will be there on the VM. So you just have to click number of nodes, the type of cluster, and then click, and then it will start the cluster in a few minutes, and you don't have to worry about the OS patching and all that, so everything is taken care of. But you have to worry about the data and the application for which you are using it. What's this? Azure Data Lake Analytics, I would it say falls more in between PaaS and SaaS, or towards SaaS, where you don't even have to manage a cluster. You are just responsible for running the query. And ADLA runs it for you on the back end. And then the last I want to point out is, by giving it thought in the last one of the examples. Is just like the demo that we saw, and it sees one of the retail stores, what it has is the video sensors in the stores. So as customers walk into the store, through the video sensor, that data is captured. And through machine learning APIs it will identify the person as male, female, the number of people in the room, and all that. And so it will collect some of the demographic data and based on that it will run promotions in the store. So if you have like 50 people or 20 people in this age group and male versus female and this is the right sort of promotion I want to run, all happens the back end through the computer and the machine learning and it presents that thing, and then that promotion is run in the store. Similar to some of the Chili's examples that I talked about. So how does it all fit in? Goes to the architecture that we talked in the beginning, your event is being injected, and then you have the bulk ingestion. Data goes into Data Lake Store, you have different options for compute, machine learning, ADLA, HDInsight. And then for visualization, you are using SQLdb or Power BI and then you have to make sure the data does not become and has the right parts. So we talked about the data lake. We talked about the Lambda solution and how we want to architect using the concepts that it defines. We talked some of the services that are sort of similar but you also want to make sure which service you want to use. Here are some of the IT Pro resources. Some of the general things which is very generic. With that, I have a minute for any question and I wanted to thank you and I appreciate you for coming to listen to this talk. >> [APPLAUSE]
Info
Channel: Microsoft Tech Summit
Views: 5,201
Rating: 5 out of 5
Keywords:
Id: lNQRtNF2dcA
Channel Id: undefined
Length: 71min 34sec (4294 seconds)
Published: Tue Jan 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.