Cool. Good morning and
welcome to this session. My name is Bhanu Prakash and I am
a program manager in the big data product group that builds
Azure HDInsight, data lake, and other big data tools. Today I'm going to talk about how
to build robust big data solutions. And to me, this means two things. One is basically, why you want
to build a big data solution. What are the business problems
you want to solve using that? And then secondly,
how you want to build a robust architecture of the big
data end-to-end solution. So, I'll focus more on the latter
side which is the architecture. But I'll also start with
the business use cases of why you want to build a robust big
data solution, because I think it's always important to understand
what are the business problems you are going to solve by
a big data solution. So let's get started. So we'll start with understanding
what is a data lake and how it fits in your data strategy. We'll talk about architecting
the Lambda solution for big data. How many of you here have heard
of the Lambda architecture? Okay, quite a few, I would say one tenth..So, I will
give some time to define that. And then thirdly, oftentimes what we hear from
the field or customers is, hey, you have got three services
which almost do the same thing. And which one I should use? So I will talk about the different
services that are available and which one you should
use with scenario. So let's talk about
the evolution of big data. So originally, big data started as a solution
to solve large batch problems. So the things to do,
like ETL, do it overnight. Do processing of petabytes
of data every night. And then get the results and then do the same thing like over
a regular interval of time. But the reality is people started to
use and aspire to use big data in a way as they would to any of
their on-premises solutions. So more like real time solutions
doing sentiment analysis and so on. And then when you had batch and
real time, it evolved to doing more like
predictive analytics, so that business users could not only
do batch processing of the data but also real time processing
at the same time. Use some of the tools like R or
other technologies to do predictive analytics by doing
machine learning and other tools. So before going into
the architecture, as I said I wanted to start with
some of the business use cases and what people are doing in real life. Let's start with some
of the examples. So Virginia Tech start
was using the petabytes of data to basically process that
data to do sequencing of genomes. And what happened was,
when they're doing this on-premises, it used to take them two weeks
to do one sequencing of genome. And when they moved to Azure,
with the busting of HDInsight, they were able to do 20
sequences of genome per day. Not only that, with that they
where able to scale their compute nodes when they wanted, when
the processing requirement was huge, and scale it down or
shut down the cluster. And that helped them to save cost. So, if you're requirement is
to use 20 nodes for day one, scale it from 10 to 20 nodes,
and the next day, or on Sunday, you don't want to do any compute,
just shut down the cluster and then restart the next day. So that helped them
save a lot of money. Other companies have been using or plan to use big data more
from a BI point of view. In this case, Carnegie Mellon was using big data solutions to
save or reduce the energy cost. A lot of the real estate,
things like universities, malls, strip malls, large companies like
Microsoft, real companies have also been thinking about how they can use
big data to reduce energy costs. So in this case, what Carnegie
Mellon was able to do just by using SQL, Power BI, and HDInsight, they
are able to save 30% of the cost. And that's without even streaming or
machine learning. And all of these companies,
they have been thinking if they could save even 10 or 20%,
which is a significant cost savings, it would help them
save a cost a lot. Let's talk about the companies that
have been using big data solution to streaming. I talked about batch,
I talked about BI and then I'm talking of the streaming
part, which is more like real time. So if you go to some
of the restaurants, what you'll see is,
on the table there is this tablet. And you start browsing
the tablet to play games or go to the menus to find
appetizers or desserts and so on. And what's happening in
the background is that data that you are seeing is being streamed back
to the compute engine in real time. So that machine learning
APIs can understand or assign a rank or score. And then send that data back to
your tablet to come up with some promotional offers. So let's say you are looking
at the dessert menu, you order the main clause. You are looking at the dessert menu
and the kids are looking at it. It will send you a promotion offer
that will say, hey for $20 dessert, for all your family or
something like that. And that's all powered by
the machine learning APIs. And through that,
companies like Chili's, they have able to improve
the customer satisfaction. So, this is one way to
sort of use streaming and machine learning to improve
the customer satisfaction. So, what would an aspirational demo
look like when you combine all the pieces, the batch, real time
streaming, machine learning. What it would look like from
an end-to-end solutions? So we have a demo that I wanted
to show before going into the architecture to see the
possibilities that big data tools can enable. So let me just. So Woodgrove is a financial company,
and I'm going to talk
about two personas. One is Jeff,
who is a financial advisor. And then we'll talk about his boss,
Rob, who manages the portfolio of his, there are lot
of financial advisors report to him. So in this case, Jeff will logon
to the app and then see how his clients are doing, get some real
time information through news, through alerts and so on. And then make some business
decisions based on that. So Jeff is out for lunch. And then he gets alert that
Contoso earnings dipped irregularities form due to
the carbon dioxide test. So he wants to know because of this news are there any clients
that are impacted by this news? So he logs on to the app, and
then in the app, you'll see there is this real time stocks data that is
coming in through even [INAUDIBLE]. And then, there is these news
videos and then the stocks. And he also sees different sectors. So since Contoso, in this case,
is automotive-sector company, he tries to see which are all
the clients that are impacted or have investments in
this automotive sector. And then he sees
Ben Andrews is on the top. Who has invested and
who seems to have a negative impact based on the investment
that he has done. So, he clicks on his profile,
and then he sees the CRM data about his age, his phone
number, the last meeting, last call. And he starts seeing all the data,
his investment and some of the KPIs in terms of
performance was a target, the churn score, the risk score. So all the things that
you see on the right, the churn score and the risk score. These are based on machine learning
and then all of the things that you are seeing, these are the real time
streaming data that's coming in. And then these are the CRM
data that I talked about. So, he clicks on the churn score and then this goes and
opens up a Power BI report. And to this report, he wants to
take and analyze what's going on. And all this BI report is being
fed through a SQL database or a SQL data warehouse. And then he tries to,
he can see like the Churn Analytics, the Share Volume,
Real-time Stock Trend, and so on. So, when he goes down into
the total transactions of that, what is transactions? What he sees is that Ben
is the guy who has had the max symptoms of the total
transactions among all his clients. So, things like that is a sort of
an indication that we've been moving away from being his client because
of all the news that he has seen. So, it goes and then clicks
on Customer Channel Art and he makes a note that he will
talk to his manager and then discuss with his client to make
sure he can retain Ben as a client. So it's okay, and then goes back. And then now he sees like Ben is
marked with a flag both places to make sure like he remembers
whenever he logs into the app through his phone or to web app, and
he makes it in order to talk to him. So this was one of the personas. Let's talk about the other persona. So Rob as he talks to his manager,
he signs in, and he gets a data which is
important to him telling to. So here's the streaming data, this
is the stock quotes, see all those things like the top performers, the
bottom performers, and at the same time he wants to make sure that
the system has to be compliant. And all the alerts,
that he's seeing is, again, powered by
the machine learning APIs. The machine learning algorithm
starts happening in the back and so there's some thing called
suspicious trading. So he clicks on that, he wants make
sure like his director [INAUDIBLE] and clients that he have
are not doing anything illegal. He goes to the power
review report and so he looks at like suspicious trades,
individual stocks and so on. So on this he sees a chart. And then let's say you select for
Michael Alan for this security stock. And it says like Michael
has done a transaction, which a lot of transactions
where the value's a lot, would happen just before the stock,
like went to its peak. So, that was sort of a clear
indication that maybe that he had insider information and that is the
reason he did lot of transactions. So he goes ahead, clicks on
Alert Compliance, to make sure his compliance team investigates, to
make sure the system is compliant. So it goes and then locks back. And this was sort of
an example where we talked about how a system, a big data
solution can be built in a way that it has everything built
into the application. You don't have to go to
different things, but everything is built into
the business application that your users, business users,
are using and pass them. So let's go back to
the architecture. So let's talk about how we want
to build this in the right way, with the right architecture,
with the right tools, that will enable a solution similar
to the demo that you just saw. So oftentimes when I get into
call with the customers, our field, one of the questions. I would say three types of questions
that falls into three categories. The first one is, hey, I'm not sure
if my architecture is correct. Can you take a look? And part of the reason for that is
they are not sure if they have to use events hubs or Kafka, SQL DB or
SQL data warehouse or data-like analytics versus SD insight,
stream analytics or spark streaming. And that leads to second category
of cushion and I talk about going through all these different services
and which one you should use. And then thirdly, there
are situations where a customer goes based on the knowledge or
information they have. They start building their
architecture and test them, and just two weeks before the production
they come back and say, hey, I build the solution and we are going to lie
into production two or three weeks. Can you please take a look if
my architecture is right or not? And so, we'll walk through
how these different questions of how do we want to answer
these different questions and how it helps to sort of define what
architecture you want to build. And frankly speaking, these would
help us, so it comes to define, like the different code
principles are there for building a big data
architecture within Azure. So the first one is
decouple the data bus, and we'll talk into detail about it. The second one is to build
the Lamdba architecture for the robust big data solution. And then, the third is choosing
the right tools and services. And the fourth is, understand
the importance of managed services, whether you want to use on IS or
PaaS or SaaS. So we'll talk about that. So, decouple the data-bus. Let's talk about it. So, on, in the on-premises world,
one second. In the on-premises world,
if you look at people using Hadoop the steps in wall would be to first
sort of scope the node cluster size. So whether you need two nodes,
ten nodes or so on. And once you have then you
get the open source bits or bits from Cloud data. And then you build it,
you configure it. You have to make sure all
the services are up, running. You have to then administer it,
monitor it continuously. If things go wrong,
you have to make sure the VMs and the nodes are up and running. And then you later deliver data and
then queries. A user add queries to take
meaningful information. But this, to again build it to get
the hardware and also getting it approved from your company it would
take like a couple of months or so. And once you are at this stage,
if something goes wrong you have go back to stage one just to meet
the size, how many notes you need. And then do all the steps. So let's say you start
with two nodes and at this point the storage and
compute isn't the same thing. So there is zero CPU utilization,
zero storage consumed. And then after six months,
you'll see that the storage because the data has been
continuously added. It's almost full. Compute is like 25,
on an average 25 [INAUDIBLE], so you have compute,
there are a lot of queries. But the storage is
almost 90% filled. So you think let me add more nodes,
so you add two more nodes. And then storage is
almost like 50% or 40% your clients computer like here
a little bit around the same lines. So now you're doing fine and then you start having
more users more queries. So on Day 400, it's like compute
max hour is like 100% peak and then 75% average, but the storage is
fine storage is not the bottom line. And then you think about, okay,
I need to add more nodes, because I can't do
any more processing. So you go ahead and
then double it again. And now your storage is
also sparsely, there. And then compute, is also fine. But this was just an idea of
the fact that because storage and compute are in the same thing you
can't scale them independently. You have to add both parts even though one of
the is the bottle neck. So, what this means is in a typical
big data pipeline when I say decouple the data-bus, you have
to have a separate store and you have to have a separate compute. The storage some of
the solutions we'll talk about, what it has seen in the previous
example that you saw was when storage was the bottleneck,
you are adding storage and compute together, because
they were from the same part. When compute was the bottleneck
you are adding, storage and compute again together. With this sort of of
architecture when storage and compute are separate, you have to
scale them up or down independently. And so we'll talk about
how the data is ingested, through bulk ingestion and
the event ingestion. And then of course, it has to be
consumed by the business users. In this case more that discovery and
visualization, they are different. By discovery what I mean is
basically how you want to enable the business users access the data
through making sure they have got the paths documented or you're using
tools like Informatica and so on or data catalogue. And then visualization is what? You enable the business, users to
run like interactive reports just like the demo that we saw where Jeff
and Rob went to the Power BI report and do some real time based on
the machine learning information, they were able to take decision. So visualization is like
power through Power BI or Informatica, or Tableau. So let's starts all the product
examples of how this would fit in. So far the bulk ingestion
will talk a little bit about Azure Data Factory and how it helps
in moving the data, will spend a fair amount of time in talking
about Azure Events Hub and Kafka. And then for the storage, we'll go into in depth about talking
about the Data Lake Store and then we'll talk about the compute
using different applications. We'll talk about HDinsight, we'll
talk about Data Lake Analytics, we'll talk about Screening Analytics
and Machine Learning. And then for visualization we'll
talk about Power BI and SQL DB and SQL Data Warehouse and how and
which one you should use. So Data Lake Store. Data Lake Store is
generally available. So until last year it
wasn't public preview, I think in November we made
it generally available. It's available in couple
of regions in the US. So, basically Data Lake Store
is something that we built from ground up, used on the
learnings that we had for internal storage that powers lot of internal
applications within Microsoft. There are no limits to scale. What I mean by that is you could
literally store petabyte file size and millions of files in
the same storage account. The throughput is good. You could store data in
its native format, and I'll talk about that
in a little bit. It has got all the enterprise-grade
access controls, the role-based access controls. You can set up a firewall to
make sure a defined set of IPs could only access
the Data Lake Store. There is encryption at rest. And when I talk about scalability,
we'll see performance difference, especially when the size of
your the size of your data is big compared to any other
similar tool and does the same job. So Azure HDInsight. Azure HDInsight, how many of you
here know about Azure HDInsight? Okay, a few. So, Azure HDInsight is
a passe kind of service for open source Hadoop and Spark. What I mean by that is, the example
when I showing you have to purchase hardware and do everything,
install the services and do on. In this case what you have to do
is just go into the Azure portal, you select the type of
the cluster that you want. Whether it's hive, we added Spark,
I think last year. Kafka is in public preview,
Storm and so on, for our server. So you select the type
of the cluster, gives some inputs like how
many worker nodes you want, what type of VM you need,
D series or A series and so on. And within a matter of few minutes
you get your cluster up and running. And so if your needs change, you
want to use it let's say just for batch processing, run it and
then delete it and then recreate it when you want it. If you want to use it for real time, let it run with
the right amount of nodes. If you want to scale it up,
you can scale it and if you're you can scale it down
just by going to the slider and then writing the right
number of nodes. It has support for
Visual Studio, Intelligeck and all sort of your existing
development and environment. We are constantly evolving as
the insight has been generally available for a couple of years and
so we recently shift secure Hadoop. So it has got all the enterprise
features like monitoring, it also supports ranger. It has Active Directory
authentication Integration so you could have the right
set of security built into the Azure HDInside cluster. And the fact that you don't have to
pay for the hardware, you are only paying for the compute and then for
the storage you are paying separately for whether you are using
Azure Blob Storage or Data Lake. You can save your total cost of
the ownership, because you will be paying only for the time of
the compute that you are using. If you want to have just two nodes,
you are just paying for two nodes and not ten or the number
of nodes that you started from. So I talked about the fact that we
also added R server which means, compared to the open source R,
it has like thousand x more data capacity and 50 x more performance
compared to open source R. And Kafka is in public preview so we'll be going into GA
soon up with Kafka. And it has support for the right
set of BI tools that you need. So Data Lake analytics. So, as I was talking about HDInsight
which is more like a passe services and you don't have to take care
of the VM, no patching and so on. Azure Data Lake analytics, is something more of a size I would
say, or more like a job service. So, in this case you don't have
to even spin up a cluster. What you have to do is literally
go to the Azure portal or go tthrough Visual Studio. And the you write up your query,
using SQL like with a mash up of C. And then you deploy that query and
then we take care of running that query on a shared or
unclustered which is managed. And then based on how much
parallelism you want, you choose that level of parallelism
whether like four vertices, ten vertices, that will define how
faster your query will be executed. And then you pay for the time
the query runs into the level of parallelism that you have chosen. The good thing about both
Data Lake analytics as well as Azure HDInsight is,
we provide for Azure HDInsight. We provide best SLA in
the industry which means it's not only your VM instances, but
the end-to-end cluster, we are making sure that it is up and
running 99.9%. So if your VM goes down or
if service goes down, we make sure it's up and running. And we have an on-call team,
to help support that. In fact,
if you're familiar with Hadoop, there is something called NameNode. In traditional Hadoop, you have
to go, the default is one node. But as the insight we have like
made it highly available so if one goes down it like fails,
so to the other one. And so talking about
the Data Lake analytics, it has the rule based
access control. It federates across
the different data sources. So what I mean by that is
when you run your query in Data Lake Analytics, you don't
have to copy data from database to Data Lake Storage, or Data Warehouse
to Data Lake Storage or somewhere. You run the query on
the Data Lake Analytics cluster and then we take care of integrating
across different sources, augmenting the data and there isn't
any duplicate copies of data in that because that you
wouldn't want to do. So, let's talk about some of
the user's scenario here. So, before like big data came in and
some of you might be familiar like we have been one
of the scenarios that Hadoop and other tools like HDInsight or ADLA. And this is doing ETL. So what I have seen, some of the customers, they
are using ETL at the sources where data come in before feeding that
data to SQL Data Warehouse. My recommendation would be to stop
doing that because you could use the parallelism offered by
HDInsight and Data Link analytics to do ETL right there instead
of doing it on the sources and then using Data Warehouse and
database to do the ETL. Because, let's say if
you are using Hadoop to, let's say run your job in 10
nodes or 20 nodes or 100 nodes, you could do the same power and
use the power to distribute it and parallel computing
to do your ETL jobs. And that's why what I mean by that
is you load your data into Data Lake store or blob storage, let
HDInsight or Data Lake analytics do the ETL and then you store it
back to data-like store or other storage to do machine learning
and other processing on it. But you don't have to spend
your existing data warehouse or other tools to do that for you. So, let me tell you what it could
mean if you were to move from On Premises to Azure and what a
simple architecture might look like. So here, you load the data into
Azure Data Lake store, and then you have two compute nodes. Let's say you have added two
clusters HDInsight clusters or you're doing processing through
ADLA, Azure Data Lake Analytics. And then through ADF you are pushing
that data to SQL DB, which kind of drives the power BI, which is
the source of the data for Power BI. Now what happens is your
storage needs scale, so your data grows and
Data Lake store grows, and at the same time your
compute needs also grows. So Azure Data Lake Analytics and
as the insight also grows. And then, one other things you
will see is in the past you were using SQL Database, and now you
are using SQL Data Warehouse. And I talk about which one to use or
which one, but SQL Data Warehouse when your storing needs to grow
more and like 1 terabyte and all. You move to like SQL Data Warehouse
and the good thing about this architecture is each pieces of
the puzzles should be able to change as far as with other equivalent
services that support most scalable or those scenario within
impacting the rest of the system. So in this case,
you're allowed to move from SQL DB to SQL Data Warehouse, and
Power BI which was working with SQL DB now works with SQL Data
Warehouse without any issue. The same for Data Lake Store. It was feeding data to SQL DB and now it's feeding data to SQL Data
Warehouse and it works fine. And then, you added more nodes so you're about to scale your
compute independently. You're able to scale your
storage independently. You're able to scale
your presentation layer, which is SQL DB to
SQL Data Warehouse, and that is rule number one of
building a robust solution. That is, scale your components without impacting
the rest of the system. And then, in the very future, we
will be adding support for PolyBase, and what it will allow is to parallelize the feeding of
data to data warehouse. That will make sure it goes to
the data warehouse much faster than it would with Azure Data Factory and
that way, your business users don't have to click and then wait
for five minutes or two minutes. It will all real time and
much faster. So Lambda architecture, I talked about the fact
that we'll about this. It was described by a guy,
I think, five or six years back. What it tells is, you load your
data, you do batch processing, and then you do real-time
processing in the same environment. So you are getting the best of both
worlds in the same environment. You don't have to do batch there and
then realtime here. You're all doing it
in the same solution. So how do you go about building
architecture lambda architecture? So generally speaking,
it's in four layers. You start with the data consumption which is the bulk ingestion or
the event ingestion. And then, you go through
preparation and analyze ways, which is more like transformation,
extraction and all that. It has speed layer and it has batch
layer, both are equally important. This is sort of
the computer part of it. Compute as well as the storage,
which are all separated. In the batch,
what I was saying in the past was, using your ADLA or
HD Insight to do batch processing, this is what it refers
to the batch layer. Where you use HDInsight and then
do processing of your queries and let's say it takes 1 hour to do
the queries, you can just scale up your lots and do the same
query in like 50 minutes. And when you create a tool
on on-premises world, it will be like a real time but
in Azure or in the cloud world it will be,
I would say, relish time. And then,
on the top which is the speed layer, when you talk about IoT scenarios
where the data is coming at a very high rate, you add the speed layer. So the tools that will help process
and store the data at that rate. And you combine both speed and batch layer just like the example
of the demo that we saw. And then, the writer's
representation which is how you consume the data through Power BI,
Tableau, and other tools. And make sure tools like SQL Data
Warehouse is sort of the back-end data for the presentation. So let's talk about some of
the concepts of the requirements for the Lambda architecture. Scalability. We talked about this sort
of different examples, so we talked about the fact that
you can scale your HDInsight nodes from 10 nodes to 20 nodes
just by going through a slider, by doing manual resize and so on. Saying you were able to do it for
your storage, or you are able to do it for
Azure Data Lake Analytics just by going to Visual Studio changing the
slide order the portal, adding more number of vertices loading
the number of vertices and so on. So basically, each piece or
each component should scale up or scale down and that helps
you to save and manage cost. Resiliency which is
basically the for something to come back if something
breaks, and both Hadoop and other components have
resiliency building to it. For HDInsight, we say especially
as compared to Hadoop it has more resiliency building, the fact that
we have added the main node which is highly available, there is
automatic failure that happens. The VM instances will be up and
running 99.9%. That's not just for
a VM but whole into cluster. And so, it is very important to make
sure you have a resilient service that supports this architecture. When you talk about the latency
the fact that specially in case of Kafka, the data has to come
at a very fast rate and the system has to support
law latency for that. Extensible by that handshake, what I mean is the fact that you are
able to swap database with a data warehouse only available to change
from like HDI to ADLE and so on. So the system should be able to
swap the right side of services or the tools for the purposes that you need without
affecting the end to end scenario. And then, of course,
it has to be maintainable, in the sense like it has to be, it
has to have the ability to upgrade. Or to apply the patches to the VMs. And then, you have to make sure that
you are choosing the right set of tools to do that. For example, let's say you have
the option to run Hadoop or an IS manner, so you go and
use the IS VMs and then install Hadoop on that, and then the other
options is you directly go ahead and use as the Insight service. So a HDInsight, the west patching
and all, making sure like it's available 99.9% and
all that is already available. But in the IS you have to do some
sort of things to make sure it gets to that level. And what if if you
are going more and on but it says you have to do
literally everything. So you have to make sure what level
of manageability you need and how would you want to maintain it so the rights of the tool
needs to be there. So let's start of the first
part which is data consumption. So we talked to the fact
that it's basically the first layer where the data
is injected into the system. So let's talk about
the event ingestion. So you have data at the very high
rate that's coming through business apps, or web apps, or your custom
apps, devices, sensors, lot of IoT scenarios are being enable
in slight very hard this days. So the data is basically
coming at a very high rate. And then, so here,
you have two major options. One is Azure Event Hubs. And it is a really simple-to-use
service that allows you for getting the events
from different sources. And then, Kafka which is the open
source Kafka which is something that you can manage as a past service and
the fact that it's in public figure so you can use either of this
services to inject the data. And then, for processing or doing analytics on that,
we have three options. The first is Streaming Analytics and
will talk about all this, which is again sort of job service,
a SaaS service through which you can do processing of the data when
it comes at a very high rate. And then,
you have open source Storm and then Spark Streaming,
we shall talk about. And then once the data is processed,
it goes to the Power BI for consumption. And then, you also want to make sure
that the data lines sync to ADLS for future machine learning or
other processing. So let's talk about
ingestion technology. So it looks like there are two
options, Kafka and event hubs. You come across different tables as
we compare and contrast the service, and I [INAUDIBLE] that shows
the different shading factor. In this case, the reason I
have made Azure [INAUDIBLE] compared to Kafka is the fact
that in even Tableau you don't have to manage
a cluster as a past service, you don't have to go ahead and
create a cluster and so on. In Kafka, you have to do that. But is a managed service, so
it's a very simple to use service, Azure Event Hub. You are looking at a throughput
of 20 throughput units. The cost is low. You have to pay it for
the usage of that stream, and then the record size is 256k. Both of these are sort of on a in
the fact that they could process like millions of events per second. But if you are looking for something
with a very highly scalable needs that could process multi
billions of events per day, especially like connected cost
scenarios, you go with Kafka. But if you are looking
to just start with, you start with Events Hub
which is pretty simple. But when you need change and
you're looking for more, processing of
events at a larger scale, then you would need to use,
you should think about using Kafka. So we talked about the ingestion,
let's talk about the other components of
the streaming analytics. So we talked about the fact
that it's getting injected to Event Hub and Kafka and now it
goes to like stream processing. So there are three options in this. Stream analytics, storm,
and spark streaming. Stream analytics like
the ADLA analytics is, well ADLA is more like a bad thing, this
is more like real time streaming. So you do processing using like
SaaS based sort of service. So It's very simple and easy to use,
you could use like windowing aggregation from out of the box and
you don't have to like have a cluster up and running
in the case of Storm and Spark. If you're going to just start like
using this, any of the streaming processing start with ASA,
which is pretty simple to use and you can always move to
Spark streaming and Storm. One of the things that I would
point out in the case of Spark and Storm is the fact that these
are open source technologies. So the ecosystem is pretty huge. And what I mean by that is if you
have any questions or anything like that, you could basically
find answers on the stack or floor on the same devices,
asking the questions and it takes awhile to get answers. And then, Spark and Storm I would
also say it's more you would think about using it from highly
scalable scenarios. You can manage the costs and
all that, but from the ease of use perspective,
from a perspective, you should start with something
like stream analytics. So how does Event Hub and streaming sort of handle
the scalability scenarios? So what Event Hub does is
when the events are coming, it partitions into
different PartitionId. And then the egress stream
analytics consumes it, you can partition it by PartitionId. And that way instead of
running one single query, you are essentially
running multiple queries. And that way the speed of
the clusters increases. What this would mean is, you would
want to use like some machine learning and
streaming at the same time. Which is good, just like the example
where we saw like streaming happening and
then machine learning and so on. But one of the caveats or the down
side is like the performance might go down because of the fact that
now your data is being streamed, processed by Stream Analytics. And before it goes to
the Data Lake store of application, now you are using ML APIs to
assign a rank or a score and then ML has to process it. So it would decrease the speed
of the rate at which the events is being injected to Data Lake or
application by two or three times. And the recommendation there is
like if you really wanna use it, then use the fact that you could
partition it by PartitionId and then have multiple queries. Or the other option is like you let
it go to the Azure Data Lake Store and then do the machine learning
in a batched manner and then let it feed into
the application. So let's talk about the data store. So Data Lake Store, we sort of touched base on this in the fact
that there are no limits to scale. You can have petabyte file size, millions of files in
the same storage account. It can be there for
as long as you want. Right set of security
is built into it. And then seamlessly scale
from like K-Bs to petabytes. So some of the technical
requirements that were chosen while designing Data Lake Store. As I said, it was built in from some of
the learnings that we had in-house. Or the fact that security, it has the right level of
enterprise security linked into it. So the fact that it prevents
unauthorized access make sure the right sort of IPs have
access to it, it has files and full level access. It's highly scalable and it supports
both like real time batch machine learning, it's like
the sole source for all the different types of
parsing data that you would need. It supports low latency and
reliability is built into this. So, I talked about the fact that,
we'll talk about the data types that can be stored in Data Lakes so
here is sort of perspective on that. So, when you talk about big data
you have to store elect the data to come as it is coming
from different sources. So you don't want like to change the
structure and all that because you want to have all the data that comes
in in any format that's there. So start with the unstructured. So you're storing the logs, you're
storing the pictures, and all that. It also has support for
semi-structures, so basically JSON, XML, Avro,
and different schema things. And then structures, basically CSV and other
structured formats, it supports. So when you are using Hive, one of the recommendations there is to
save the data in a columnar format. One of the reasons I say is
when you are talking about big data the tables are so wide, if
you are using the row best format, you would have to
scan the whole row. And that way because we are scanning
the whole row even if you are doing like select column A, column B, or
so on, because of the fact that's scanning such a wide row
it takes a lot of time. So imagine if you have to scan so
many rows versus just scanning the two columns
that you're interested in. So that's why
the recommendation is to save the data in a columnar format. The downside is, definitely, you'll
have to convert into that format. But yeah, from a query perspective, it's pretty fast if
you say I have that. And there are two formats
in that available. One is the ORCFile format that
Microsoft built in conjunction with Harden Works and
open source community. It is best for Hive, it allows
all ACID properties insert, update, delete. There is also the Parquet
format that you could use for mixed Hive and Spark workloads. So let's talk about choosing
between Azure Data Lakes Store and Blob Storage. So Azure Data Lakes Store, one of the things is It is not
available in all the regions. So if you are someone in let's say,
Europe, or Asia Pacific, the option available
is to use Azure Blob Storage. And if you want to start in the US, there are some of the regions
which have Azure Data Lakes Store. The thing is, both of them
are generally available, but we have hundreds of customers
who are using Blob storage. And so if you are looking for
something like petabyte file size and that many number of files, you could have like multiple Blob
storage accounts to deal with that. The good thing is like compute
tools like XT insight or AD Azure DataLake analytics,
the performance does not impact even if you have
multiple storage accounts. So you have like multiple
storage accounts, if your scalability
needs up that high. But if not, basically with one or two accounts that would
serve your purpose. Other thing to talk about
here is the fact that for the Data Lakes Store,this the
storage that we expect in the future most of the customers to be in
like to the most of the customers are using Blob storage. But we expect,
over a period of time, to move them to Azure Data Lakes
Store because of all the scalability benefits and all that. Blob will have some of
the bandwidth issues and so on. But at the same time, if you are using like multiple
storage accounts, the performance, from compute sort of perspective,
would not suffer. So, often times, we talk with
customers like how you want to store data for
enterprise robustness? And one of the recommendations
is that you store the data by organizational folder structure,
application folder structure. And so one of the questions
that customers ask is, hey, I'm using this Azure Data Lakes
store account, and how, I have to have multiple business
units that want to store data, my HR, e-commerce and
finance department. How do I go about
building them separately? I don't have a good answer to
that other than telling like, hey, go ahead and
use the friends storage accounts. And in that way you have
the right sort of making sure different accounts, like HR
has it's own storage account, e-commerce has it's own
storage account, so you don't have to like have
different accounts and all sorts. But at the same time, you are able
to build them separately. And the fact that I talked about it
does not impact the performance no matter how many number of
storage accounts you have, you will refine the data approach. The other thing is, like of course,
you have different folder structures for applications and
business events and so on. You want to make sure the data is
at the granular level of access is partitioned based on the year,
the month, and the time, and what level of partition
do you want to go up to. And the prime reason for
that is when you are running your queries like Hive, you only scan
the files and the folders that you are really interested in,
versus scanning the whole thing. So that helps to sort of also
have the right structure for the data and at the same
time helps in performance. So other thing I talked about in
the presentation are the fact that there is this data discovery. For business users who are not so
tech savvy and want to have access to
the right sort of data. Always make sure the parts
are documented correctly. I've seen examples where customers
have used acronyms or crazy names. And if there are new
employees in the company, and you are not used to that acronyms,
they'll find a hard time. And that's what you call dark data,
that data is not being used. So you want to make sure even though
you have option to use tools like Informatica, Data Discovery,
Data Catalogue and all, or do you want to make sure, the smart way of doing it is to
make sure it's properly documented. And profiting after
the fact is difficult. So when I'm talking about making
sure you have the right set of structures, for
files, folders, and so on, you want to build
it right the first time. If you don't do it the first time,
you can do it later as well. But you have to give enough time,
and it is a cost for you. So you want to make sure
it's correct the first time. So big data storage for enterprises. So you want to make sure that you
have the right level of isolation for the data for
the different needs. For example,
you have to have isolated source for the data which is raw. So all the raw data that comes in,
there are no right permissions given to developers or
business users and so on. It is just a raw data for the data
to come in from different sources. Then you have the staging where it's
copied to a staging environment for developers and test. They are using it for
test dev environment. And once the vetted data, it is vetted and trusted, it goes
for, in production in trusted area. So where the data is only used for
production purposes. And then you have a sandbox
environment where you have the data which is copied
to the sandbox environment for like data scientist and
business users. So they are doing hidden trial and one in different ways to create the
right sort of BI tool, BI use cases. And once they find that
one out of the hundred, they move it back to the staging. So it could be pushed
to the trusted or vetted region to be able to run
in the production environment. So these are some of the access
characteristics based on.These are different
classification of data, like hot, warm, cold,
between different characteristics. The reason I wanted to
show it was because, know that it's not just volume or
item size or one of these that defines whether
data is hot, warm, or cold. It may be that the data takes
a long time to get the data, and you're fine and
that's like cold data. So it's also the volume,
the size, the latency and different factors that helps to
sort of categorize the data. And then what I've done is
i have put those different characteristics on a Venn diagram. And the reason for doing it, you want to make sure you
are using the right storage based on the right kind characteristics
that you need. You don't want to, just because
of the fact that you have a Blob storage account, you are putting
all the data into that scene. See what the scenario is and
put the data in the right storage. That may mean like you have multiple
things but that also means like you are using it to build
the right trouble solution. That will serve for the right
reason at the cheapest cost. So let's say in this case, if you're
looking for something, let's say, high in structure, you would go
with SQL DB or SQL Data Warehouse. But if you are looking for
something very low latency, then you would think about
using Cache or NoSQL database. And if you are looking for something
like cold data and low in structure, we'll start with thing like HDFS or
Azure Data Lake Store. So what this means is to categorize
this a little further in terms of tools. If you are looking for structure and relational database start
with the Data Warehouse. For complex even processing, you are using the stream
analytics and all that. We'll talk soon. And then for our non-relational,
you are using the new SQL Data. And then Data Warehouse, so
let's say if your size is, data need size is less than one PB,
you are using SQL DB. And if you are using it for a highly scalable scenario,
you want to use Data Warehouse. And then we talked about this, so
I don't want to go into it again. But one of the things I want to
point out, with Spark Streaming, you are basically using it for the
fact that it has very low latency. And if you are already using Spark,
it's like a one in all solution. And I'll share, go little bit deep
into it in the next 15 minutes. So this is something from
a processing point of view like different tools. So if you're looking for
high SQL compatibility, it's SQL Data Warehouse versus
like a high query latency. It would be like MapReduce
in the case if Insight or a low query latency it will be
like as Insight with Spark. So, in terms of the big
data pipelines, one of the things i would tell is,
customers will always go and think about hey,
let me build this 40 node cluster. And I would use it for like ETL and the ETL processing will run
overnight for eight hours. And then in the daytime I would
use that same cluster for doing user processing. And so they come back and say, hey,
I'm looking for this use case. What is the right number of
nodes that I would use for these scenarios? And the right answer is,
why just one cluster? Why do you want to spend
like 40 node cluster for two different scenarios,
when you could serve the same purpose by using two
different cluster types? And what I mean by that is, the fact that your ETL will
be done with 40 nodes, you are literally using that same
40 nodes for the rest of the day. And remember, even though the fact
that compute nodes are running, you are not doing any processing,
you are paying for that. So why you want to use that 40
node cluster, even though the fact that for the user queries, you may
end up using only four nodes or eight nodes for the rest of the day? So the recommendation for
that case would be use 40 nodes for ETL during the night
like running for data most of the time that you used,
shut it down. And then for user queries you just have eight
nodes that you want to use it. The fact is when you shut down the
40 nodes cluster you are basically not paying for it. And then have the eight
nodes just for the user queries for
the that you want to use it for. And in that way you
are paying only for the uses that you are consuming
in different forms. The good thing is both can
access to the same data. So it's not like for ETL,
the 40-node cluster you have to have this data and then for the user
code you have to have this data. They can also access the same data. And in that way, it could help
serve both the scenarios. So, now we talked about
all the different things. Let's talk about interaction like
how the user, business users, will consume that data. So, on 2017 and one of the predictions from Gartner was
that by 2017, a lot of the user will like to use the interactive layer
using a self service tool and so on. And so
you don't always see it happening. There are two options for that. SQL Database and SQL Data Warehouse, that would serve as
a backend data for Power BI. What I mean by that is when you are
looking at using these two options, SQL DB as I said it's more for
something we're looking for data less that one terabyte. For SQL data we're also testing all
the properties that you saw for as being sight or EDLA computing. The fact that you can scale it up,
scale it down, you can pause in seconds. You can scale up the storage and
the compute independently. And all these characteristics
are part of the SQL data warehouse So you could run like T-SQL queries
that you are already familiar with. And the fact that it has
integration with PolyBase, and that one that will be coming for ADLS in the nearly future will
make it even yet more powerful. So it's basically running SQL on
SQL, that is the backend being SQL Data Warehouse and then you
are running SQL on top of it. So let's talk about Data Lake and
Data Warehouse. For some of the things that you
might be thinking is, hey so I heard about Data Lake. I heard about Data Lake,
Data Warehouse. Now which one I should use? And the reality is if you go
back to that base architecture, base Lamba architecture,
where you saw presentation, and then you saw like Datalink. You can use it like complimentary. Datalink is like complimentary
to the Data Warehouse. When I say that is, it goes back
to the things that I discussed in the past where you want to use
the parallel processing offered by tools like ADLA or
HD Insight to do the ADLs. And if you don't want to use Data
Warehouse for doing all that ETL, the Data Warehouse just serves as
a data For your presentation layer. So you don't want to scale it to
do ETL jobs, you want to scale it based on the number of users you
want to support for interactivity. Because Data Lake is
there to serve the data, which is the raw data, or the fact
that you have to do ETL using the data in the Data Lake Store,
you can put it back into that. And once everything is structured, you then pass it back
to the data warehouse. And again, I repeat, with that,
it might mean having multiple tools. But it just means building out our
worst architecture in the cheapest possible way. So we talked about this on
the Data Lake, using Data Lake and the right sort of compute tools,
like HDInsight or ADLA. You are ingesting, you
are cleaning the data, cataloging. You are doing machine learning and
all that, and then you are pushing it
to data warehouse or SQLdb to do structure analysis
starting at the backend data for Power BI or Tableau, and then doing
quick aggregations and so on. So the fact that PolyBase support
would be there, it would mean like the data is being injected at a very
powerful rate by using parallelism. And you could use it to,
you could use for both sides where the data
is one on loop and then other data warehouse
integrates very, in a nice manner. So we did not talk
a lot about Spark. So let's use this time
to go into details. It's like the newest, I'll say
darlings to the big data world. And it has the promise of having
everything in the same environment. So it has Spark SQL. It has support for Spark Streaming. It has machine learning libraries. It has GraphX. And the thing is,
it is very interactive. It's highly interactive
with all-in-one tools. So we'll talk about some of the,
like how it compares to other tools. It is available in Azure,
Azure Spark cluster so you have to manage as a fast
service on HDInsight. So just like you go and create a
cluster there is a cluster type spot which is supported on Linux. I think the latest is 2.0.1,
which is much more faster than 1.6, with a lot of bug fixes. It has the same SLA offers
that you would have for any other HDInsight cluster. And so, it is 100% open so
it's taken from Apache, and we are using the Hortonworks
data platform base for Spark, just like other
HDInsight cluster types. You get the 99% SLA guarantee
that has got the right sort of certifications just to make sure the
cluster is compliant and all that. You don't have to do either way. It has lot of community
support being an open source. You would be seeing development
happening in this sort of very fast rate, all because
of the fact that it has the promise of doing everything
in the same environment. So why would you not try
to take advantage of that? So it has support for even Scala, we have the Jupiter Notebooks
support and Zaplin. And for the development environment
it has the support for IntelliJ and Eclipse and you could do it through
the programming languages like Scala, Python and so on. So I talked about this,
one of the things you want to consider is when you want to
contrast it with other services. Let's say Spark SQL, like
the fact that you have everything available and if it works for
the scenarios, that's fine. I would consider it relative to SQL
Server on-premises, like when you have the license, you want to use
it for SSIS or analysis service or SQL database, and it works for
you, go ahead and use that. And the same thing
applies here as well. But the thing to consider here is it may not be as robust as some of
the general purpose-built solutions. So let's say you talk
about Spark SQL. SQL in this case is not like
the first-class citizen in Spark compared to ADLA or
because when those were built, they were built with more SQL
in mind compared to Spark SQL. But again, being an open source things are,
development are happening there. Spark Streaming, when you compare it
with stream analytics, it does not have the same sort of robustness as
compared to zero stream analytics. When you think it from
ease of point of use. And then when you talk about
machine learning, it may not about the same level of parallelism as
you would see on the open source R, which is, again,
available on HDInsight. So you want to make sure, you want
to use some general purpose tools or whether this thing works fine, because of the fact that you are
using the same set of environment. The good thing about this, as I
just talked about, interactivity. Especially on small datasets,
it's highly, highly interactive. It will feel like all those things
are happening in real time. For large datasets,
because of the fact that it uses in-memory computing, you need
to have SSD caching and so on. And then going back to
the old school is the fact that machine learning
uses sampling and so on. It has like some training data sets. So there in-duilbt sampling that
is supported out of the box so you can use it especially
if you are having issues because of
the last data sets. That can help combat those issues. So what does it look like
putting it all together? You have the data coming in from your on-premises
while at different places. And then it goes into Azure Data
Lake Store or the blob store. And then you do compute
using HDInsight or Azure Data Lake Analytics. And then on top of that, you are
doing machine learning, which then feeds to the published side,
which is the Azure SQL Database, or SQL Databases, or
SQL Data Warehouse. And then you use it to feed
the data for Power BI so business users or
data analysts could use it for the meaningful information
that they want to use for. Let's add more complexity to it, and this is sort of the [INAUDIBLE]
architecture that we talked about initially when we were talking
about the ingestion layer, the speed layer, the batch layer,
and the presentation. So right now you have
not just batch events or the data that's coming in,
you also have the streaming events that are coming in
through Event Hub or Kafka. And then all your data is going
to either Data Lake Store or Azure Blob Storage through
the different tools that we talked about, ADF or even Server Kafka. And then you are doing things
like compute using HDInsight or you're doing compute using ADLA. And that is like batch processing, on top of that you're
doing machine learning. And then you are going to the
streaming layer where the data is coming at a high throughput rate for
IoT or other scenarios. You are running tools
like stream analytics, part streaming to do analysis
on that real-time data. And then you are adding on top
of that machine learning and then feeding the data back
to Azure Data Lake Store so that you could do some back
processing on that same data. And through all these
different sources, the data is going to Azure SQL or
Azure SQL database and then which you would use as a
backend data for Power BI or Tableau reports for the business users to
make meaningful decisions from it. So what does it mean from
the manageability prospect there? So this is some of the last
part I was talking about, from a manageability perspective. You would have seen this chart
in a lot of the conferences. But one of the things that
I wanted to point out, since you are talking about building
a robust big data solution, is you want to be all
towards the greener side, towards the right side,
than towards the left. So let's take example of HDInsight. It would form as a platform
of service for says ADLA, which would be more like
a software service or in between. And versus Hadoop on-premises
which is like the first all blue. So in the first case like
Hadoop on-premises you have to make sure you have
to have buy the hardware. You have to install the Hadoop. You have to manage it, administer
it, and do the whole thing, versus if you're using HDInsight,
all the things, like runtime, is all taken care of by HDInsight,
which will be there on the VM. So you just have to click number
of nodes, the type of cluster, and then click, and then it will start
the cluster in a few minutes, and you don't have to worry about
the OS patching and all that, so everything is taken care of. But you have to worry about
the data and the application for which you are using it. What's this? Azure Data Lake Analytics, I would
it say falls more in between PaaS and SaaS, or towards SaaS, where you
don't even have to manage a cluster. You are just responsible for
running the query. And ADLA runs it for
you on the back end. And then the last I
want to point out is, by giving it thought in
the last one of the examples. Is just like the demo that we saw,
and it sees one of the retail stores, what it has is the video
sensors in the stores. So as customers walk into the store,
through the video sensor, that data is captured. And through machine learning APIs it
will identify the person as male, female, the number of people
in the room, and all that. And so it will collect some
of the demographic data and based on that it will run
promotions in the store. So if you have like 50 people or
20 people in this age group and male versus female and this is the right
sort of promotion I want to run, all happens the back end through the
computer and the machine learning and it presents that thing, and then
that promotion is run in the store. Similar to some of the Chili's
examples that I talked about. So how does it all fit in? Goes to the architecture that
we talked in the beginning, your event is being injected, and
then you have the bulk ingestion. Data goes into Data Lake Store,
you have different options for compute, machine learning,
ADLA, HDInsight. And then for visualization, you
are using SQLdb or Power BI and then you have to make sure the data does
not become and has the right parts. So we talked about the data lake. We talked about
the Lambda solution and how we want to architect using
the concepts that it defines. We talked some of the services
that are sort of similar but you also want to make sure
which service you want to use. Here are some of
the IT Pro resources. Some of the general things
which is very generic. With that, I have a minute for
any question and I wanted to thank you and I appreciate you for
coming to listen to this talk. >> [APPLAUSE]