DP 900 Microsoft Azure Data Fundamentals in 4 hours

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] [Music] hello and welcome to this virtual session we're glad you can join us today before we get started there's a few housekeeping items we'd like to go over with you now firstly you can resize the webinar windows to cater to your viewing preferences you can maximize minimize and drag the windows to your preferred viewing size if you look at the bottom middle of your screen you can click on the widgets that you'll need to get the most out of this virtual experience secondly microsoft specialists are on hand to answer your questions in real time so feel free to type in your questions using the q and a window and we'll answer them as soon as we can lastly we've provided some additional resources for you to supplement your learning you can access them by clicking on the links in this section without further ado i'll hand over to our speakers welcome to the microsoft virtual training day for the data fundamentals course so in this training we're going to take you through the fundamentals of data and database concepts we're going to take a look at core data concepts and we'll teach you what relational data is what that means what non-relational data is and what that means as well as take a look at data analytics solutions in azure so that at the end of this training hopefully you'll have the skills to both provision and deploy data services in azure now in module one we're going to take a look at core data concepts we'll also take a look at some of the roles and responsibilities in data we'll have a look at some of the relational data concepts as well as non-relational data concepts and we'll take a look at some of the data analytics concepts in azure as well so let's get started with module one so who am i my name is michelle and i am an azure technical trainer at microsoft which means my job is to help upskill people to the cloud now previous to this i was an azure specialist and an azure tsp so i've been doing azure for quite a while outside of microsoft training teaching and working i like to travel and i'm a big foodie as well so i follow where my nose takes me i also do a bit of coding and writing in my spare time so if you're interested you can find some of my blogs on my social media now on the top right corner there you'll see a big azure data fundamentals bash now if you set the dp 900 exam and you pass it this is the certification that you'll get so you'll get the azure data fundamentals certification as well as a badge that you can use to show off now let's take a look at the course agenda for this training we've got four modules in this course starting with module one where we're going to explore core data concepts and so we're going to have a look at some of the roles and responsibilities in data we're also going to talk through some of the relational data concepts and non-relational data concepts as well as concepts to do with data analytics in module two we're going to explore relational data in azure a little bit more in depth so we're going to have a look at some of the relational data offerings in azure how to provision and deploy relational data offerings in azure and how to query relational data in azure in module 3 we're going to explore non-relational data in azure and so again we're going to have a look at how to provision and deploy non-relational data services in azure and we're also going to have a look at how to manage non-relational data services in azure and finally we're going to wrap up with module 4 where we're going to explore modern data warehouse and data analytics in azure and so we're going to have a look at what a modern data warehouse looks like data ingestion in azure some of the data storage and processing capabilities in azure and finally we'll have a look at what microsoft power bi does lesson one explore core data concepts now what is data well by definition data is a collection of facts such as numbers descriptions and observations used in decision making so you can classify data as structured semi-structured or unstructured now structured data is data that adheres to a schema so all of the data has the same fields or properties structured data can be stored in a database table with rows and columns semi-structured data is data that doesn't fit as neatly into tables rows and columns instead semi-structured data uses tags and keys that organizes it in some sort of hierarchy and so a typical example of semi-structured data is data stored in json format or javascript object notation now unstructured data does not follow any schema or structure nor does it need to be organized in any particular way so unstructured data can be anything such as video files image files text files or audio files when we talk about data we typically talk about transactional versus analytical data stores so a transactional data store or transactional system is often what most people consider to be the primary function of business computing a transactional system records transactions and these transactions could be financial transactions such as movement of money between bank accounts and the banking system or it could be a transaction that is part of a retail system like tracking the payments of a good and services that a customer has paid for so you can think of a transaction as a small discrete unit of work now an analytical system is designed to support business users who need to query data and get that big picture overview of the organization so analytical systems is really concerned with capturing raw data and using it to generate insights so an organization can then use these insights to make business decisions so for example you might be a manufacturing company who wants to find out the trends in the industry and determine which products to continue to launch or launch next to maximize profitability and so you can use an analytical system to help support that decision-making process now transactional systems are often quite high volumes sometimes they can handle millions of transactions in a single day and so the data being processed needs to be accessible very very very quickly so the work performed by a transactional system is often referred to as an online transactional processing system or an oltp system to support this fast processing the data in a transactional system is often divided into pieces so for example if you're using a relational data system each table involved in the transaction contains the columns necessary to perform that transactional task so if we're talking about a bank transfer example then we might have a table that holds information about the funds and accounts that might contain the customer account number and the current balance other tables not involved in the transfer operation would hold information like name address of the customer and maybe account history now splitting the tables out into separate groups or columns like this is called normalization so normalization can enable a transaction system to cache the information required to perform the transactions in memory and really speed up that throughput to process the data while normalization enables fast throughput for transactions it can actually make querying quite complex so queries involving normalized tables will frequently need to join the data together across several tables to be able to execute that query and so this can actually make quite difficult for business users who might need to examine the data now most analytical data processing systems need to perform quite similar tasks such as data ingestion data storage data transformation data querying and data visualization so this is a pipeline of services to build an analytical system now data ingestion is really about capturing raw data so this data could be taken from devices that are for example measuring environmental information like temperature pressure or it could be from point of sales devices that's recording items being purchased by a customer in a supermarket or it could be recording financial data where we are collecting the movement of money between bank accounts or it could be weather data that's coming from a weather station and so some of this data can also come from your oltp systems or your relational databases from on premises so to process and analyze this data we must first store the data into some repository of some sort and this repository could be a file store it could be a document database or it could be a relational database and then we have to transform and process this data because the raw data might not be in the format that is suitable for querying what that means is your data could contain anomalies so these anomalies should be filtered out and it might require some transformation of some way shape or form so for example dates or addresses might need to be converted into a standard format also after the data is ingested into a data repository you may want to do some cleaning operations to remove any questionable or invalid data or perform some aggregations such as calculating profits calculating the margin or any other key performance metrics or kpis that are used to evaluate how your organization or your business is doing in terms of growth and performance against the industry or in the market now once we've processed and we've transformed the data and we make it ready to be used we can then start querying and doing some analytics on our data so this typically happens after we've ingested and transformed our data during the data querying process we might be looking for trends we might be attempting to find certain root causes of problems or issues that is happening in our organization or business so there are many database management systems that provide the tools to enable you to perform these ad hoc queries against your data and generate regular reports now the data once we have the results from our queries we probably want to visualize it and present it in a visual way and so this is where the data visualization stage happens because data is usually represented in tables with rows and columns and it's not as easy to digest in that numeric format or in those tables and documents and so visualizing the data can often be quite a useful tool to examine the data and so you can generate charts like bar charts line charts plot plotting the data onto a geographical map if it's geographical data and you can really illustrate how data changes over time and pick out the trends in your data so microsoft actually offers a visualization tool to enable you to do so the service is called microsoft power bi and it provides a rich graphical representation of your data now data processing is simply the conversion of raw data to some meaningful information through a process so depending on how the data is ingested into your system you could process each data item as it arrives or you buffer the raw data items and process it in groups so processing data as it arrives is called streaming and buffering and processing the data in groups is called batch processing so in batch processing the newly arrived data elements are collected into a group now the whole group is in a process at some future time as a batch exactly when each of these group is processed can be determined in a number of ways so for example you can process the data based on a scheduled time interval say for example every hour or you could use some trigger say for example when a certain number of data items have arrived then we trigger that process to go and batch process it some of the advantages of batch processing is that you can process large volumes of data at a convenient time and so you can schedule that run at a time when for example your computers or systems might otherwise be idle so you can batch process overnight or you can batch process during off-peak hours some of the disadvantages of batch processing is that there is often a time delay between ingesting the data and getting the results so all of the batch data jobs input data must actually be ready before a batch can be processed this means the data needs to be carefully checked because if we identify any problems in the data it can cause the batch job program to crash and so the input data must be thoroughly checked before the job can be executed even very minor data errors like a typo and the dates can actually cause the batch job to not run now in stream processing each new piece of data is processed as it arrives so for example as we ingest that data we process it in real time and so streaming handles data in real time unlike batch processing there's no waiting until the next batch interval to then process that data because we're actually processing each individual pieces of data item as they come in and so stream data processing is very very useful for scenarios where we're dealing with new dynamic data that is being generated on a continuous basis lesson two exploring roles and responsibilities in the world of data so there are three key roles that deal with data in most organizations and these are data administrator data engineer and data analysts as a data administrator their main role is really to manage databases assign permissions to users store backup copies of data and restore data in the case of any failures as a data engineer their role in data is to work with the data to apply data cleaning routines and transformations build out those data pipelines and processes identify business rules and really turn data into useful information and prepare it for data analytics workloads and as a data analyst their role is really to explore and analyze the data by creating visualizations and charts which help enable an organization to make better and more informed decisions so some of the common tools that a database administrator might use includes azure data studio sql server management studio as well as the azure portal or the azure cli so a database administrator will need to use these tools to configure the overall availability of their data servers and databases they're responsible for maintaining consistent performance and optimizing the data servers and databases so they need to work with the stakeholders to ensure that policies are implemented that the tools and processes are set up in place and that backup plans are available to recover any data servers or databases that might fail due to some error because of some natural disaster or human-made error and so they're also responsible for the security component of the data servers and databases which means they're the ones who grant access rights to users to determine who gets the right to access databases and who doesn't some of the tools that a data engineer might use includes azure synapse studio sql server management studio as well as the azure portal and azure cli so as a data engineer their responsibility includes maintaining and managing data privacy within the cloud as well as on-premises so they're also responsible for managing and monitoring data stores the data pipelines making sure that the data performs as expected by doing a combination of cleaning and transformation on the data to prepare the data for data analytics queries some of the common tools that a data analyst might use includes power bi desktop power bi service and the power bi report builder so data analysts are typically those who design and maintain scalable business models with data they might also engage with cleaning and transformational activities to further advance their analytics or build reports and visuals to help business decision makers digest that information and the results in a more visually pleasing way the main goal of a data analyst is really to turn the raw data into insights based on the businesses requirements and then that insight can then be turned into actionable decisions lesson three describe concepts of relational data we'll start with tables so a table is where data is stored in relational databases they consist of rows and columns and all the rows have the same number of columns each of the column is defined by a data type now one of the main benefits of a computer database is that they make it easy to store information so that it's quick and easy to find for example an e-commerce system might use a database to record information about the products that this particular e-commerce site organization might sell as well as some of the details of their customers and the orders that they've placed and so we have a customers table here which stores the customer id the customer name and the customer phone number now a relational database provides a model for storing this data and query capabilities that enables us to retrieve this data very very quickly so let's say i wanted to look up customer id 106 from this table i can see that it's alex peterson in the relational database you typically model collections of entities from the real world as tables so for example a customer is an entity therefore we create a customer table an entity generally describes a thing about information which we want to hold inside a database in the e-commerce example here we might also create tables for products and orders because they are also entities in the real world now table contains rows and each of these rows represents a single instance of an entity so what that means is each row in this customer table contains data for a single customer and each row in the products table will define a single product in the products table and each row in the orders table will represent an order made by that customer now the rows in a table can have more than one columns that define the properties of the entity so the the properties in this case would be customer id customer name and customer phone number and all the rows in the same table will have the same number of columns and some of these columns are actually used to maintain relationships between tables which is where relational databases or relational model gets its name from typically when we're designing a relational data model we need to normalize the data so that we can reduce storage we can avoid data duplication and improve data quality so for example instead of storing a replica of the customer's details in the order table we would split out that data into two separate tables and reference the customer table from the orders table so the orders table now contains both a customer id and it also contains a product id now the customer id relates to the customer id in the customer table and similarly the product id will relate to the product id and the products table and so we typically design a relational database by creating these relational data model and each table will have these primary keys so the customer id and the customer table order id and the order table product id and the product table these are all examples of primary key and the primary key is being used to associate with a foreign key in another table and so the customer id and the product id in the orders table are foreign keys and this foreign key references the primary key and is used to establish that relationship between tables now another concept to know in relational data is indexes so an index helps you search for data in the table you can think of an index over a table like an index at the end of a book so the index at the end of a book helps you find certain pages relating to certain topics or subjects so that when you want to find a reference to a particular topic or a subject you simply have to look it up in the index at the end of the book and then go to that specified page number imagine that you didn't have this index so to find out about a particular topic or subject you'd have to read through the entire book to be able to find that information now and so similarly when we create an index in the database we specify a column from a table where this index contains a copy of this data in a sorted order with pointers to the corresponding rows in the table and so when a user runs a query that specifies this column in say for example the where statement clause the database management system can then use this index to fetch the data much more quickly instead of having to scan through all the tables in the entire database row by row to find out that information so you can create many indexes on a table say for example if we wanted to find all the customers for a specific region then we can create another index on the customer id column from the customers table but indexes are not quite free so an index would consume additional storage space and each time you do an insert update or delete to a table the indexes for that table needs to be maintained and so this additional work can actually slow down your database operations and therefore if you're deciding to use indexes you must strike that balance between having indexes to speed up your queries versus the cost of performing your database operations now typically in the table that is read only or contains data that is modified and frequently more indexes will actually help improve the query performance if the table is queried infrequently but it's subject to a large number of inserts updates and deletes then creating indexes on that table can actually slow your database systems down now a view is a virtual table based on the result set of a query so in the simplest case you can think of a view as a window on specified rows in an underlying table so you can query the view and you can filter the data in the much in much of the same way as you would with a table now if you can also join tables together so if you regularly need to say for example find the details of customers and the products that they've ordered you could create a view to do so lesson four exploring concepts of non-relational data now in a non-relational system you typically saw the information for entities and collections or containers rather than relational tables so two entities in the same collection can have a different set of fields rather than the regular set of columns found in a relational table this lack of fixed schema means that each entity can be self-describing so often this is achieved by labeling each field with the name of the data that it represents so for example a non-relational collection for customer entities might look something like this now a key aspect of non-relational database is that they enable you to store data in a very flexible manner because non-relational databases don't impose a schema on the data instead they focus on the data itself rather than how to structure it so this approach means that you can store information in the more natural format that might mirror the same way that your applications might query and consume and use it so if we take a look at this example here the fields are prefixed with a name and some fields might actually have multiple subfields so for example we have a look at telephone we can have a home telephone a business telephone a cell telephone and these multiple subfields are denoted by enclosing them between square brackets and so if we wanted to add a new customer to this non-relational data store it's a matter of inserting an entity with its fields labeled in some meaningful way now the application that queries is data must be prepared to actually pass the information and the entity that it retrieves the data retrieval capabilities that a non-relational database and the non-relational database can actually vary so each entity should have a unique key value for example the id needs to be unique so the entities and the collection are usually stored in some key value order for example the key here is id name telephone and address and the value would be one mark hanson the phone numbers and addresses of mark hansen the simplest type of non-relational database enables an application to either specify the unique key or range of keys as the query criteria and so in this example the database would enable an application to query customers by id only now filtering data on other fields would actually require us to scan through the entire collection of entities pass each entity and turn and then apply the query criteria to each entity to find the matches and so for example if we were to do a query that fetches the details of the customer by id we can quickly do so but if we had a query that attempted to find all the customers with the uk address for example we would have to iterate through every entity and for each entity examine each field in turn and so you can imagine a database that contains millions of entity records this kind of query would actually take a considerable amount of time to run and so you would typically query a non-relational database with its unique key value for each entity some of the more advanced non-relational systems support con supports indexing much like the indexing in the relational database so queries can then use the index to identify and fetch the data based on a non-key field some of these more advanced non-relational systems that supports this are for example azure cosmos db which is a non-relational database management system that's available in azure and so with this indexing we can actually query non-key fields from a non-relational database now semi-structured data is data that contains fields and the fields don't have to be the same in every entity so you only define the fields that you need on a per entity basis the data must be formatted in such a way that an application can then pass and process it and one of the more common ways of doing so to store the data for each entity and fetch it is using json so json stands for javascript object notation it's a format that's typically used by javascript applications to store data in memory but it can also be used to read and write documents to and from files so typically a json document is enclosed in curly braces each field has a name followed by a colon and then the value of the field the fields can contain simple values or can could contain sub documents each also starting with its own curly braces fields that contain multiple values are held in arrays and the rays are the values that's denoted in square brackets inside these values you can actually define whatever you like so the important point to make here is that as long as you follow the json grammar you can import any type of data into the value now when an application reads a document it needs to use a json parser because this json parser will then break up the document into its component fields and extract out the individual pieces of data now avo is a row based format so it was actually created by apache uh each record contained a header that describes the structure of the data in the record this the header is actually stored as json and the data is stored as binary information so an application will typically use a header to pass the binary data and extract out the fields that it contains aver is actually quite a good format for compressing data and minimizing storage and network bandwidth requirements now our core rc stands for optimized row columnar format and it organizes data into columns rather than rows so it was developed by hortonworks for optimizing read and write operations in apache hive apache hive is a data warehouse system that supports fast data summarization and querying over a very large data set so hive also supports sql-like queries over unstructured data now an arc file contains strips of data each strip holds the data for a column or set of columns a strip contains an index into the rows and the script and so the data for each road and there's also a footer that contains statistical information so this is the statistical information are things like the sum the count the max and then of that particular strip of data which you can access and is actually another columnar data format it was created by cloudera and twitter file they contain row groups so data for each column is stored together in the same row group and each row group contains one or more chunks of data so a parque file includes metadata that describes the set of rows found in each chunk now an application that can then use this metadata to quickly locate the correct chunk for a given set of rows and retrieve the data in the specified columns for these rows so is very it's developed for storing and processing nested data types quite efficiently and it supports very efficient compression and encoding schemes as well now unstructured data is data that doesn't naturally contain fields so examples include video files audio files and other types of media streams each data item is a blob or binary data and so you can't actually search on any specific elements inside this kind of data you might for example use azure blob storage to store this to store unstructured data so blob actually stands for binary large object and in azure blob storage block blobs are supported so block blobs is used for basic read and write operations and it has no internal search capabilities so you could also consider for example files as a form of unstructured data although some files actually have metadata that kind of describes what the content of this file might contain so you can think of an image or photograph as a file word document as a file excel spreadsheet they're all files and some of that metadata for the files might include who the owner of the file is when that file was created so all the properties of the file that's all the metadata but the main contents of the file itself is actually unstructured now you might have heard or seen the term nosql floating around when you're reading about non-relational databases no sequel is rather a loose or generic term that refers to or simply means non-relational so some people debate that nosql means not sql or not only sql but essentially it just means non-relational data now no sql or non-relational databases generally fall into four categories so we have key value stores document databases column family databases and graph databases now a key value store is the simplest and often quickest type of nosql database for inserting and querying data so each data item in the key value still has two elements it's got a key and a value and the key uniquely identifies the item and the value holds the data for that item so this value could be as opaque to the database management system which means the database management system only sees the value as an unstructured block of data vice versa if the if the value was not opaque so transparent that means the database management system actually understands all the fields and how that particular data is organized so a relational table for example has a transparent structure now to query a key value store we must specify the keys which identifies the items to be retrieved so you can't search on the values so an application that retrieves data from a key value store is responsible for passing the contents of the values returned and write operations for key value store is typically restricted to inserts and deletes only so if you need to update an item you must first retrieve the item modify in memory and then write it back to the database so it's typically a retrieve modify and insert right back to the key value store the focus of a key value store is really the ability to read and write data very very quickly and so that's why search capabilities are secondary so key value stores are great for data ingestions where we're dealing with a large volume of data that's coming in continuously as a stream and needs to be stored very very very quickly azure table storage is an example of key value store and the azure cosmos db actually implements a key value store using the table api now in a document database the data is stored as documents and each document has a unique id but the fields and the documents are transparent to the database management system so document databases typically store data in json formats but they can also use other types of encoding so it could be xml formats yaml formats or binary json documents could even be stored in plain text so the fields and the documents they are exposed to the storage management system which enables an application to query and filter the data by using the values in the fields now typically a document contains the all the data for an entity so what items that might constitute to an entity are application specific for example an entity could contain the details of a customer its order or a combination of both so a single document might contain information that would typically spread be spread across several relational tables in a relational database management system a document store does not require all the documents to have the same structure so this kind of free form approach means we have a lot of flexibility in terms of the data that we insert into the document store and applications can store different data and documents as business requirements change and so an application can then retrieve the documents by using the document key and this key is the unique identifier for the document so some document databases can create this document key automatically others might require you to specify which attribute of the document to use for the key the application can also query documents based on values of one or more fields because some document databases might support indexing to facilitate this fast lookup of documents based on the fields instead of the key so most document databases that are used to ingest large volumes of data very rapidly in comparison to a relational database but they're not as optimal as a key value store for this kind of processing the main goal of document database is that it has that search query capabilities that's a key value store typically doesn't have now azure cosmos db actually implements a document database using its core sql api a column family database organizes data into roads and columns so examples of the structures includes awk and parque files so a colon family database can appear very similar to a relational database at least conceptually but the real power of a column family database lies in its denormalized approach to structuring sparse data so you can think of a colon family database as holding tabular data comprising of rows and columns but you can divide the columns into groups known as column families so each column family holds a set of columns that are logically related together say for example we can use a column family database to group data into two column families that might hold both customer information and housing information and then we can fetch for both customer information and housing information at the same time with a single query now an an analogy that i like to use to explain column family databases a little bit more is let's imagine i held a virtual video treasure hunt through youtube i uploaded 10 videos to youtube each video are 10 minutes in length and i tell you that i have hidden a secret letter inside each of these videos now if you were able to find each letter and you can combine it to form a word then the secret word is actually the location to the price for the treasure hunt now after watching a couple of these videos you notice that i've hidden the secret letter at exactly the four minute mark for every single video which means if you go to the four minute mark for every single one of these videos you'll be able to find the secret letter and in turn the secret word now in this analogy for a column family database the video is essentially the column the group of 10 videos is the column family and each second of the video represents a row in each of the columns and the four minute mark is actually the key used to fetch the data for the rows in each of the columns so you should always implement your column families to be most optimal for some of your most common queries that your applications might need to perform each row and a column family database contains a key and you can fetch the data for a row by using this key so in most column family databases the column families are stored separately and the data for a single entity that spans multiple column families will have the same row key in each column family one of the most widely used column family database management system is apache cassandra and as jerk cosmos db actually supports the column family approach through its cassandra api now a graph database stores two types of information so there is nose and edges notice you can think of as an instance of an entity and edges specify the relationship between notes so both nodes and edges can have properties that are used to describe the relationship between one another additionally edges can also have a direction indicating the nature of the relationship so the purpose of a graph database is to enable an application to perform queries quite efficiently by traversing the network of nodes and edges to and to analyze the relationships between the nodes or between the entities if we take a look at an example let's imagine that we are representing all the employees in an organization with the graph database or graph model and so each node represents an employee and the edges represents the reporting line for each of the the employees and so in this graph the arrows on the edges show the direction of the reporting line so with this structure this kind of graph model and database we can ask questions like find all the employees who report to the it manager for example or we can ask questions like find out who works in the same department as me so for large graphs with lots of entities and relationships you can perform these very complex queries uh quite quickly and so we can use graph databases to traverse the network of relationships and help provide answers to those type of complex queries and sql how you would normally do these kind of queries is that you might have to do a lot of recursive joins operations and nested sub queries which can actually become very very expensive and slow to do so azure cosmos db supports graph databases through the gremlin api the gremlin api is a standard language for creating and querying graphs listen five exploring concepts of data analytics so data analytics is concerned with taking the data that your organization produces and using it to establish a picture of how your organization is performing and what you can do to either improve or maintain that performance so data analytics helps you to identify both strengths and weaknesses in your organization and it helps you make the best business decisions based on some of the results so the data that a company might use can come from many sources now in a data analytics solutions you would typically combine all this data from your many sources and construct a data warehouse where you can then use it to ask questions about your business operations building a data warehouse will require you to capture the data that you need and transform it into an appropriate format for you to do those kind of querying and data analytics workloads you can then use visualization tools to examine the information presented by your data and help identify trends issues problems causes to your organization now typically to start the data analytics process we begin with data ingestion so data ingestion is the process of obtaining and importing data for immediate use or we can store it in a database so during the data ingestion process data could be ingested through a continuous stream or could come in batches the purpose of the ingestion process is really to capture this data and either process it immediately or store it so during the ingestion process we might do some filtering for example if we are ingesting data that could come from a suspicious corrupt or data source and so suspicious data might be data that is coming from an unexpected data source corrupt or duplicated data could be due to some device malfunctioning or there might be transmission failures or someone might be tampering with the data during this stage it's also possible to do some transformation to the data so we can convert some of that data to a more standardized for processing later so for example if we're talking about dates and times we might want to reformat it so that all the data that's been ingested will use the same date and time representation or if we're talking about measurement data for example we might want to convert it so that they all conform to the same metrics but if we're doing transformation at this stage remember to keep them quick don't want to run any complex calculations or aggregations during the data ingestion stage if you do need to then store the data first and we'll come and process it later which leads me to data processing so the data processing stage generally occurs after the data has been ingested and or collected so data processing takes data in its raw format and it cleans it converts it into a more meaningful format that can then be stored in tables or graphs or document databases and so on and so forth so the results can then be used to perform queries and generate visualizations to be consumed by employees throughout the organization now the aim of data processing is really to convert the raw data into a business model so a business model describes the data in terms of meaningful business entities and we might want to aggregate some stuff together to summarize information so the data processing stage we could also use it to generate predictive and analytical models from the data data processing can actually be quite complex so it might involve automating a lot of scripts you might have to use tools like azure data bricks azure functions azure cognitive services to examine and reformat the data and generate those models so a data analyst could then use machine learning to help determine future trends based on the models that's been created from the data now the data processing mechanism can take two approaches to retrieving the ingested data and processing this data to transform it into one of these business models and then saving the transform data and models so these two approaches are known as etl and elt etl stands for extract transform and load so the raw data is retrieve and transform before being safe so during the transform stage the data might get standardized clean transformed aggregated some of the missing values could be filled out corrupt data could be chucked out other tasks could involve some basic data cleaning de-duplicating data and reformatting some of the contents of the data so once the data is in a standardized format that is suitable for analytics it gets loaded into a repository system the extract transform and load steps can be performed as a continuous pipeline of operations so etl is quite suitable for very simple and basic models that has very little dependency between items the alternative approach to etl is elt so elt is short for extract load and transform so this process is slightly different because the data is first stored before it's transformed so the data processing engine can take an inter iterative approach by retrieving and processing the data from storage before writing the transform data and models back to storage now elt is more suitable for constructing complex models that depend on multiple items in the database often using periodic batch processing elt is also a more scalable approach that is suitable for cloud workloads because it can make use of the extensive compute power of the cloud so elt is more stream orientated than etl because with etl we can filter the data before it's it's stored which is good for data privacy and compliance as we can remove any sensitive data before we store it however elt we can't do that because we load the data before we transform it and process it so azure provides a couple of options for you to implement both etl and elt so for example if you're storing data you can store data in an azure sql database and then you can use the sql server integration services to build out your etl and your elt pipelines the sql server integration services can extract and then can transform data from a whole range of sources uh such as xml data files flat files relational data sources and then load the data into one or more destinations another approach is to use azure data factory so azure data factory is a cloud-based data ingestion service that allows you to create those data-driven workflows and pipelines for orchestrating data movements and transforming data at scale so with azure data factory you can create and schedule data driven workflows also called pipelines that can ingest data from a lot of different data sources and you can build quite complex etl processes that transforms data visually with data flows or you can use other compute services like azure hd insight azure databricks azure sql database to do some of that transformation as part of the pipeline once we process our data and we've gathered insights and results from our analytics and queries we can visualize the information that our business models we built presents using data visualization tools so to start the visualization piece we generally start with creating reports to report on some of our findings and results so reporting is the process of organizing data into informational summaries to monitor how different areas of an organization is performing so reporting helps companies for example monitor how their online business is doing they can find out what's not performing well good reporting should actually raise questions about businesses from its end users and using reporting we can find out things that happened and then we can use analysis to explain why certain things happened and what we can do about it now business intelligence or bi refers to technologies applications and practices for collecting integrating analyzing and presenting business information back to the organization so the main goal or purpose of business intelligence is really to support better decision making now business intelligence systems can provide historical current and predictive views of the business operations and often they can use data that's been gathered in a data warehouse or they could use data that's from live operational data systems so some of the things that they might be able to do is present an interactive slice and dice pivot table analysis visualize that data or do some statistical data mining with the data so we could also create application as part of the business intelligence process to tackle particular areas in the organization like sales or production or financial to really track how the business is performing against the industry and against the market so data visualization helps you to focus on the meaning of data rather than looking at the data itself so a good data visualization enables you to quickly spot trends anomalies and potential issues some of the common forms of visualizations that you might use include bar and column charts line charts matrices key influences tree maps scatter plots a bubble chart or a field map so with bar and column charts they allow you to see how a variable changes across different categories with line charts we can see how the over or shape of the values are doing over a period of time with matrices these are generally a tabular structure that summarizes data and so often report designers use matrices and reports or dashboards to allow the user to select a particular component and see its significance across the data key fluent influences are charts that displays you know who the major contributors to a particular result of value is so key influences are great for understanding you know what particular factors influence a specific key metric tree maps are used as charts of colored rectangles so that the sizes of the rectangle generally represents the relative value of that item and so they can be hierarchical so rectangles can also be nested within other rectangles scatter plots or scatter charts can be used to show the relationship between two numeric values and a bubble chart is essentially a scatter chart that replaces the data points with bubbles and so the bubble size represents its relative data points in the in the data and a field map is if you've got geographical data you can actually use a field map to display how a value differs in proportion across a geography or region and so you can see the relative differences with shadings that range from light which generally means less frequent or lower to dark which means more frequent or more now just to wrap up on data analytics so data analytics is really about examining transforming and arranging data so that you can study it and extract useful information out of your data you can often categorize data analytics as descriptive diagnostic predictive prescriptive and cognitive so descriptive analytics is all about asking and answering questions like what is happening in my business so descriptive analytics helps you answer questions about what has happened based on historical data uses techniques to summarize large data sets to describe outcomes to stakeholders so examples of descriptive analytics includes generating reports to provide a view of an organization's sale and financial data with diagnostic analytics this is really used to answer questions like why is this happening to my business and so with diagnostic analytics we're generally use it to help find out what the root cause of something is why something had happened and it's really about taking the findings from descriptive analytics and diving a little deeper to find out the course and so generally in diagnostic analytics we take three steps to this approach so we identify the anomalies in the data so this could be unexpected changes in the metric or particular market then we collect data that's related to these anomalies and then we use statistical methods to discover relationships and trends that could explain these anomalies in predictive analytics this is all about trying to find out quick answers to questions like what is likely to happen in the future based on previous trends and patterns so predictive analytics is really used to help answer questions like what will happen in the future we can use techniques that uses historical data to identify trends and determine if it's likely to reoccur and some of these techniques includes things like statistical and machine learning techniques such as neural networks decision trees and regression and with prescriptive analytics this is really used to help answer questions about what actions needs to be taken to achieve a particular goal or target so by using insights from predictive analytics data driven decisions can be made so prescriptive analytics allows business to make informed decisions in the face of uncertainty so with prescriptive analytics we can rely on machine learning strategies to find patterns in a large data set and we can also analyze past decisions and events that's happened and figure out what the likelihood of uh being happened again or likelihood of different outcomes that could happen and finally cognitive analytics now this is this type of analytics is inspired by how the human brain works so how we process information how we draw conclusions how we codify you know instincts and experiences into learning and understanding so cognitive analytics really attempts to draw inference from existing data and patterns derive conclusions based on existing knowledge bases and then add these findings back into the knowledge base for future inferences so this is a self learning feedback loop so cognitive analytics really helps you to learn what might happen if circumstances change and how you might handle these situations differently and that's it for module one so we just had a look at some core data concepts relational data non-relational data and data analytics concepts as well so let's take a 10 minute break and i'll see everyone back here in 10 minutes to start module 2. [Music] [Music] welcome back from the break so let's continue exploring module two where we're going to have a look at querying relational data now in azure so let's get started lesson three querying relational data in azure so let's have a look at what sql is sql stands for structured query language sql is used to communicate with a relational database it's the standard language for relational database management systems so sql statements are used to perform tasks like updating data in a database retrieving data from a database some of the common relational database management systems like microsoft sql server mysql post-free sql mariadb and oracle all uses a variant of sql now just a bit of a fun history sql was originally standardized by the american national standards institute in 1986 and by the international organization for standardization in 1987. since then the standard has been extended several times as a relational database as relational database vendors have added new features to their systems additionally most database vendors have included their own proprietary extensions that are not part of the sql standards which has resulted in a variety of dialects of sql so when it comes to sql statement types there are three categories you should know about first one being dml or data manipulation language ddl or data definition language and dcl data control language now dml or data manipulation language are the statements that are used to query and manipulate data so they're statements like select insert update and delete ddl or data definition language is used to define database objects so they're statements like create alter drop and rename dcl or data control language are statements used to manage security permissions so they're things like grant revoke and deny so you can use dml statements like select insert update delete to basically accomplish anything you want in the database now select allows you to read and retrieve a particular data rows or data items from a table insert allows you to add a data row or data items to a table update allows you to modify or change existing data in the data row or data item in the table and delete allows you to remove a particular row item or data row from a table now although a lot of these sql statements are part of the sql standards many database management systems also have their own additional proprietary extensions to handle specifics of that database management system so these extensions are not covered by the sql standards and they include things like security management and programmability so these proprietary extensions can include things like store procedures triggers and managing user accounts so postgre sql and my sequel have their own versions of these features now some of the popular dialects of sql includes transact sql or tsql which is a version of sql used by microsoft sql server and azure sql database pg sql is a dialect with extensions implemented by postgre sql and procedural language over sql is a dialect used by oracle databases now let's take a look at the elements that constitute to a typical sql query using dml statements first we usually start with a select statement you can use the select statement to retrieve data from tables and specify certain conditions after the select statement you would typically specify the columns of the data that you wish to retrieve data from now the from statement is then used to decide which table to retrieve your data rows from then you can also optionally add a where clause aware statement which is a conditional statement that specifies which data to include and which data to exclude depending on the evaluation of the condition if you're using some of the aggregation or summarization functions in the select statement you can often use the group by statement to group the aggregation or summary that you're trying to do in the query and the order by statement which is again another optional statement that you can add can be used to rank the results of the data that you're retrieving now if we take a look at an example of a select statement in basic english this query or this this statement is essentially saying find the employees who served customer 71 at least once and rank them by the employee number now in sql what i want to do is i want to retrieve data from the orders table in the sales database and this is specified from the from statement and the data i'm interested in are employee id and the order year which is specified in the select statement and i only i'm only interested in the data where the customer id is 71 which is specified in the where statement i also want the results returned to be in groups of employee id and order yet and so this is specified in the group by statement and i only want results where there is a record return and each of the results needs to be grouped if there are multiple records found and so this is specified in the having statement and then i want to rank the results first by employee id and then by order year which is specified after the order by statement now if we take a look at an example of an insert statement so in plain simple english this is simply saying i want to add a new order to the orders details table so in sql what we do is we insert into the order details table which is inside the sales database the following values so for order id i want to insert 10255 product id i want to insert 39 unit price i want to insert 18. quantity is 2 and the discount is 0.05 which means 5 now you use ddl statements to create modify and remove tables and other objects in the database so these objects could be tabled store procedures views and so on and so forth now the most common ddl statements or data definition language statements are create which allows you to create a new object in the database such as a table or a view or alter so alter allows you to modify the structure of an object for instance you can alter a table by adding more columns to it the next one is drop so drop allows you to remove an object from a database so if you want to drop a table for example if you no longer need it and then rename is another one that you can use to rename an existing object so if you wanted to rename a table for example that's where you would use the rename statement now if we look at an example of the create statement this statement or this example creates a new database table called my table in the database and so the items between the brackets or the parentheses specify the details of each of the columns in the table so including name data type whether the column needs to have a value or not so that's not null and whether the data in the column is used to identify uniquely identify a row which is the primary key part so every table should have a primary key although sql doesn't actually enforce this rule the primary key is then used to uniquely identify each data row item in the table now columns that are marked as not null are referenced to as mandatory columns which means they need to have a value inside those particular columns if you omit the not null clause you can create rows that don't contain values inside that column so that means an empty column in a row is said to have a null value now the data types available for columns in the table will actually vary depending on the database management system however most database management systems support numeric types like int int which stands for integer and string types like varchar which stands for variable length and they generally contain text kind of data so now let's have a look at how we can provision an azure sql database and also query data inside an azure sql database so i'm going to head over to my azure portal to provision and query an azure sql database now that i'm on my azure portal this is how i'm going to provision my azure sql database so this is what the home page of my azure portal looks like now i am using the preview version so your one might actually look slightly different because preview means it's still in beta mode so to create any resource including azure sql database i can come into my left hand menu bar open this up and create a resource from here alternatively what i like to do is i actually like to search for the resource i'm creating so if i look up a sql database for example i get to create a azure sql database from here and so i simply go add and i'm taken to the configuration page and i need to go through those settings that i talked through earlier starting with the basics so i need to choose an azure subscription to provision this azure sql database 2 i'm using my internal subscription and i also need to specify the resource group which this azure sql database will go into so i've actually created one but you can also optionally create a new resource group from here so i'm actually going to create a new resource group and call it a demo and now i'm going to place this azure sql database inside a new resource group called demo here i need to specify the database name so i need to give my database a name so i'm going to call this my demo database for example and this checks out now i need to create my virtual sql server for my azure sql databases so i can come into here and i can create a new azure sql server i need to give it a unique sql server name and so this name needs to be not used by anyone before so for example if i use demo this is going to be probably invalid because someone's already created a sql server with this name so i need to give it something unique so i'm going to call it demo and just pass it some numbers now hopefully yes this one no one has used this name for a sql server so i can create a sql server with this particular name and i need to also give it some credentials to be able to log into my sql server and so again we can't use names like admin this will reject me so i need to give a server admin login name that is relevant to my sql server so for example if i say demo user and i need to input some passwords so make sure the password is strong enough and re-enter the password as well now the location is the azure region that we want to provision this azure sql database to so currently the default is acus but i can come into here and i can select any one of the azure regions around the world to provision my azure sql database 2. i'm just going to leave the default suspend and i'm just going to click ok and that's gone and set up my sql server for me now here i can specify if i want to use a sql elastic pull so it's simply a toggle between yes or no because this is a demo i'm just going to leave it as a default note and this is where i'm going to essentially pick out the size for my azure sql database so the compute and storage is essentially where i'm going to configure how much storage and compute resources that i'll need to run my azure sql database so i come into config database i get a couple of options and tears that i can choose from so general purpose hyperscale business critical and if we scroll down a little bit we can see the price of selecting this particular tier now i'm actually going to scale my database right down because this is again a demo database so i don't need that much resources i'm going to select the basic one and then i'm just going to scale that all the way down and i can see that the estimated cost for the smallest possible azure sql database is going to cost me about 6.85 a month i'm just going to hit apply and i'm going to go to the next tab which is a networking tab now with the networking connectivity we can configure how we can connect to our azure sql database now by default remember that all access to our sql database is blocked i can enable public access or i can have a private endpoint access to my azure sql database so i'm going to open this up for public access and here i can specify some of the firewall rules so do i want to allow azure services and resources to access the server probably if i want to use an azure virtual machine to access my azure sql server and do i want to add my current ip address from this device to allow access to this particular azure sql server so i'm going to as well now if we go to additional settings this is where you can specify the data source and so we can create a blank brand new database or we can create an existing one so we can restore it from backup or we can come in here and we can use some of the sample data that microsoft azure sql database provides to populate our new database so i'm going to select the sample one and the sample database is going to populate data from the adventures works lt database and down here we can also set up azure defender for sql now this is that advanced threat protection which provides the vulnerability assessment and threat protection for our sql server so i'm not going to enable it for now and then we move into the tags tab now in the tags tab generally when you provision any azure products or services you would encounter this particular tab tags is where you can add metadata to your azure resources and the metadata we can then use to filter and find our resources once we provisioned it so for example i might want to say that this particular database is a demo database and after that i can simply hit review and create now that's going to run a validation test to check that all the configuration details i provided it is indeed correct and once it checks out i simply hit create and now that's going to submit my deployment to azures to go and set up that azure sql database for me now because we are limited by time i've actually set up an azure sql database already so we can go and look at an example one that i've already set up so if i go and look up sql database again i can find the one that i've set up before this training so if i go into it i can see on the overview page or the overview tab some of the details about this particular azure sql database so for example it's sitting in the dp900 resource group it's subscribe to my internal subscription and this is the sql server name which is actually where the sql database is running on now if i scroll down a little bit we see a query editor so with the query editor we can use this to query our azure sql database so i'm going to come into my query editor and i need to authenticate myself to be able to access this particular azure sql database so if i wanted to essentially execute any queries against this particular database i need to verify myself first so i can choose between sql server authentication or i can use active directory authentication i'm going to pick sql server authentication and i'm just going to put in the password for this azure sql database let me re-enter the password and now i'm inside the query editor for this particular azure sql database which means i can use this to start executing some of those sql queries that i want against this database so again like the demo one that i set up i've actually used some of the sample data to populate this particular database so if i open up tables i can see some of the sample tables and data that's already populated into this particular sql database so i can minimize some of these panels on the left here to make more room on my screen and let's just execute a very simple sql statement like select let's look at the customers table and this very simple sql query is just going to return all the rows from the customer table so i can look at all the data inside the customer table so simply hit run and that query took one second to execute now if i move down a little bit i'll minimize the screen a little bit i can see some of the results that were returned from the query and so this is all the data inside the customer data table now if we look at something a little bit more interesting i can see that there is a sales order table in my database so let's change this to sales order so sales order and hit execute now i can see all the sales order data in the sales order detail table so it contains columns like sales order id i scroll across order quantity product id which tells me that this is actually a foreign key which represents which uh references the product id in the products table and so let's actually go and reference the products table from my order details table and so if for example i wanted to find all the products that are inside the order details table and the products table i can do something like an aggregation i need to give my tables that i'm crying from some aliases because what i'm going to be doing is actually referencing the same column name and so i need to have aliases to be able to tell sql that i'm referencing the particular column and so i'm just going to give this p and i'm going to specify a where clause which is that conditional statement where i can say i'm only interested in the results which will return me the product id from the sales order table is the same as the product id from the product table now if i hit execute get a result of 542 which tells me that the total number of products that have been ordered is 542. now what's interesting is if i wanted to find out how many products existed i can do a pretty similar query so i can do select count all the rows from the product table and i can highlight this which means i'm only going to be executing this particular query so if i hit run i find out that there are 295 products in the products table which tells me that there have been at least one or more products that has been ordered more than once from the results that i got from my sales order detail table now these are very simple queries and you can do a lot more complicated and complex sql queries but this is just some of the power that you can do using azure sql database so you can use a couple of tools to query data held inside an azure sql database like the azure portal which we saw you can also use the sql command utility or the sqlcmd utility from the command line on your device or using the azure cloud shell you can also use sql server management studio or azure data studio or the sql server data tools to query data held inside your azure sql database to use these tools you need to first establish a connection to the database so you need to provide some of those credentials to be able to log into the server and connect to it so the admin username and password for example now sql server management studio is a tool that you can download and run on your desktop and typically this has been used for connecting to and executing sql queries to on-premises sql server databases the sql server data tools is a tool that you can actually use inside visual studio to connect to your azure sql databases azure data studio is a graphical utility for creating and running sql queries from your desktop it's much more lightweight than the sql management studio and it's a great tool for connecting to your azure sql databases because that's what it's kind of built for now the sql cmd or the sql command line utility can be run from your command line and it's also available from the azure cloud shop so you specify the parameters to identify the server that you want to connect to with alongside with the credentials and then simply after that you can start executing your sql queries through the sql command line as you would in any of the other tools now postgresql provides many tools that you can use to connect to a postgresql database and run queries on the tools these tools include pg admin which is a graphical user interface tool or you can also use the pg sql command line utility there are also a large number of third-party utilities that you can use as well and so let's have a look at how we can use the pg admin tool to connect to an azure post-free sql database so now i'm back on my azure portal now i'm not going to take you through how to configure and set up the azure postgre sql server on azure because it's a very similar process to the azure sql database what i'm going to show you is how to use the pga admin tool to connect and execute queries against the azure postgre sql database so i can see that this is the azure postgre sql database i've already set up and now i'm going to switch over to my pg admin tool to connect to my azure postgresql database and execute some queries against it so i've actually already entered in the configuration strings and logging credentials to connect to my postgre sql so i can simply open it up and now under databases i see that i've got four databases some of these ones are already set up by prosecutor sql i've actually went ahead and created one called my demo database so if i come into here and i go right click i can create a script which then allows me to go and execute some sql queries against this database so this is just the default ones i'm going to erase it and this database should be completely empty so let's go and create a table so i'm going to execute a create table statement and i'm going to call this table customers now inside my customer table i need to specify a couple of columns and so let's say customer id which is going to be a primary key and it's not going to be null i'm also going to have a customer name which is going to take avatar and let's give it 50 characters and also not not now i simply hit this play button to execute the statement and we see down here that the query has successively been executed so now i'm just going to insert some values into my cust my newly created customer table so let's do insert into customers the following values and customers i also need to define the columns that i want to insert values into so customer id and the customer name and let's just say i'm going to give the id one and i'll just give it my name and if i highlight this i'm just going to insert one record into my customer table and so that's been successfully executed now if i do a select all statement from my customers table and highlight this statement and hit execute i can see that newly created customer table with the new row of data that i've inserted into return to back to me as a result just as with postgresql there are many tools available to connect to your mysql database as well that enables you to create and run scripts containing sql commands so you can use the mysql command line utility which is also available through the azure cloud shell or you can use the graphical tools from the desktop such as my sql workbench so let's take a look at how we can use the mysql workbench to connect to our azure mysql database and execute some sql queries through there so again i've already pre-provisioned a azure mysql server and database that's running on my sequel and now we're going to have a look at how we can use the my sql workbench to both access this mysql database that's running on azure and execute some queries against it so i'm on my mysql workbench and i've already pre-connected to my mysql azure database that's running on azure so i simply open up the connection and i need to provide the password to be able to authenticate into my azure mysql database you can see the user the admin username and the server connection string so now i just need to put in my password hit ok and now that's going to authenticate into my azure mysql database so once i'm in i've actually pre-defined some queries that i was executing earlier before this training and so i'm just going to highlight them and run through the same process creating a table but this time we're going to create an inventory table in our azure mysql database and so i have already set up a my demo database here and so i'm going to use my demo database and i'm going to create an inventory table that has an id as a column and name and quantity i'm just going to hit this little lightning button which will run it and that's executed successfully now i'm going to insert a couple of rows particularly these three rows of values into my new inventory table so i'm going to highlight it and hit execute and i've inserted three rows of data into my newly created inventory table and then i'm just going to select all the rows from my inventory table and i can see here all the new values that i've inserted into my table so banana orange apple and finally this drop table statement is going to effectively delete my inventory table to remove it from the database that's effectively been executed so if i go and highlight the select or from inventory table again it should throw me an error and there you see because the inventory table no longer exists in my demo database anymore this concludes the first part of this virtual training i hope you've enjoyed it so far we covered module one where we had a look at core data concepts and module two we had a look at relational data in azure so thank you for your attendance and participation so far i hope you'll join me for part two of this microsoft virtual training day for data fundamentals next time [Music] so [Music] so [Music] [Music] [Music] [Music] [Music] [Music] hello and welcome to this virtual session we're glad you can join us today before we get started there's a few housekeeping items we'd like to go over with you now firstly you can resize the webinar windows to cater to your viewing preferences you can maximize minimize and drag the windows to your preferred viewing size if you look at the bottom middle of your screen you can click on the widgets that you'll need to get the most out of this virtual experience secondly microsoft specialists are on hand to answer your questions in real time so feel free to type in your questions using the q and a window and we'll answer them as soon as we can lastly we've provided some additional resources for you to supplement your learning you can access them by clicking on the links in this section without further ado i'll hand over to our speakers welcome back to part two of the azure virtual training day for data fundamentals so today we're going to take a look at module 3 where we're going to explore some of the non-relational data services in azure and module four we're going to have a look at some of the data analytics solutions in azure now in module 3 we're going to take a look at some of the non-relational data services in azure such as as azure cosmos db azure storage account we're going to also have a look at how to provision and deploy those services in azure as well as how to query non-relational data in azure so let's get started with module 3. lesson one explore non-relational data offerings in azure so we'll start with azure table storage now azure table storage implements the nosql key value model so in this model the data for an item is stored as a set of fields and the item is unique identified by a unique key now azure table storage is a scalable key value store held in azure so you create a table using an azure storage account in an azure table storage table items are referred to as rows and the fields are known as columns but don't let this terminology confuse you by thinking that it's the same as a table in a relational database because azure table storage allows you to store semi-structured data so all rows and the table must have a key but apart from that the columns in each row can vary so unlike the traditional relational databases azure table storage tables have no concept of relationships store procedures secondary indexes or foreign keys now data will usually be denormalized with each row holding the entire data for a logical entity so for example a table holding customer information might store the first name last name one or more of their telephone numbers and addresses for that customer and the number of fields in each row can be different depending on you know the number of telephone numbers or addresses that particular customer might have now in the relational database this information would be split across multiple rows and several tables however in an azure table storage table this isn't so we're able to get much faster access to the details of a customer because all the data for that customer is available in the single row so you don't have to perform any complex joins across tables and figuring out relationships between tables now to help ensure even faster access azure table storage splits the table into partitions and partitioning is a mechanism used to group related rows based on a common partition key so the rows that share the same partition key will be stored together now the key in azure table storage table is that there is two elements you need to be aware of so there's a partition key which identifies the partition that contains the row and the row key which uniquely identifies each row in the same partition and items in the same partition are stored in the same row key order as a table storage is commonly used for storing terabytes of structured data capable of serving web scale applications so some of these might be for example product catalogs for an e-commerce website or application customer information where the data can be quickly identified and ordered by a composite key so in the case of the product catalog the partition key could be the product category so like footwear for example and the row key which identifies a specific product in that category could be something like mountain boots now storing data sets without requiring those complex joins foreign keys or stored procedures in this denormalized ways enables us to have much faster access to our data using azure table storage which means azure table storage is perfect for things like iot systems where we need to capture data very very quickly and store it immediately and so we're talking about iot device sensor data you know each device could have its own partition and the data could then be ordered by the date and time which those sensor points were collected now azure blob storage is a service that enables you to store massive amounts of unstructured data or blobs in azure so just like azure table storage you would create a blob storage account under an azure storage account now azure currently supports three different types of blobs so there is block blobs page blobs and the pin blobs a block blob is handled as a set of blocks so each block can vary in size up to 100 megabytes a block blob can contain up to 50 000 blocks giving you a maximum size of 4.7 terabytes so the block blob is the small smallest amount of data that can be read or written as an individual unit and block bobs are used to store discrete large binary objects that are changed very unfrequently now page blobs is organized as a collection of fixed size 512 bytes of pages and so a page blob is optimized to store random read and write operations you can fetch and store data for a single page if you need it a page blog can hold up to eight terabytes of data and so azure actually uses page blob to implement the virtual disk storage for azure virtual machines and the pen blob is essentially a block blob but optimized to support append operations so you can add blocks to the end of an append blob so updating or deleting existing blob isn't supported in the pin blobs and each block can vary in size up to four megabytes now the maximum size of an open blob is just over 195 gigabytes inside an azure storage account for blob storage you would typically create blobs inside containers so a container provides a way for you to group related blobs together and you can organize blobs in a hierarchy of folders similar to the way that you would organize files on your file system or disk and you can also control who gets to read and write the blobs inside the container or who gets to read and write the container itself now blob storage provides three access tears to help you balance the access latency and storage cost so the first being the hot tip which is enabled by default so you use this tear for blobs that are accessed frequently so the blob data is stored on high performance media the next tier is the cool tear so this tier has lower performance and incurs reduced storage charges compared to the hot tier so you use the cool tier for data that is accessed uh not as frequently so it's commonly for you know newly created blobs that needs to be accessed initially quite frequently so you saw that in hot tear but as time passes you don't need to access it as frequently and so in this kind of situation you can create the blob in hot tear first and then slowly migrate it to a cool tear later on and this will help you save on some of those storage costs now the third tier is the archive tier so the archive here provides a lower storage cost but it has very very high latency because the archive here is essentially intended for any historical data that you want to keep that you don't want to lose but you're not going to ever really access it so blogs in the archived hair tier are effectively stored in an offline state so some so typical reading for uh typical reading latency for hot and cold tears might range between a few milliseconds but for an archive tear this can take up to hours to become for it to become available so if you want to retrieve a blob from archite you must actually first rehydrate it by rehydrating i mean you need to essentially upgrade that blob into a hot or cold hair before you can then actually access that blob so some of the common use cases for azure blob storage includes storing images or documents that you are maybe serving directly to a browser storing files for distributed access if you are storing video and audio files that you want to stream storing data for backup restore disaster recovery or archiving purposes and storing data that is useful analysis for both on-premises or cloud environments to ensure availability azure blob storage also provides redundancy so your blobs are always replicated three times in the region in which you created that azure storage account but you can also select geo redundancy which would replicate your data to a secondary region also three times but this comes at an additional cost now other features that azure blob storage supports includes versioning so that you can maintain and restore earlier versions of a blob there is also a feature called soft delete so if you turn on soft delete this will enable you to recover any blob that has been deleted or overridden or removed by accident or by mistake there's also a feature called snapchat snapshots so a snapshot is a read-only version of a blob at a particular point in time and there's a feature called change feed so the change feed for a blog provides an ordered read-only record of the updates made to a blob so you can use the change fee to monitor changes and perform operations now many on-premises systems comprising a network of in-house computers make use of file shares so a file share enables you to store a file on one computer and grant access to that file to users and applications running on other computers so this strategy works quite well for computers using the same local area network but it doesn't scale as well for uh users who might be located at different sites or if that number of users increases now azure file storage enables you to create file shares in the cloud in azure and you can then access these file shares virtually from anywhere as long as you have an internet connection so azure file storage exposes file shares using the smd 3.0 protocol or the service the server message block protocol this is the same file sharing protocol that's used by many existing file shares on premises now that means what you can do is you can actually migrate your file shares to azure to the cloud without having to modify a lot of your applications which access these file shares so you can also control the access to these file shares in azure file storage through authentication and authorization services that's available through the azure active directory domain services again just like blob storage you create an azure file storage ins and within an azure storage account and azure file storage enables you to share up to 100 terabytes of data in a single storage account once you've created your storage account you can upload files to azure file storage either through the azure portal or you can use tools like the azit copy utility you can also use the azure file sync service to synchronize any of your local file shares up with the azure file storage so azure file storage is a fully managed service your shared data is replicated locally within the region but it can also be geo-replicated to a secondary region and all the data is encrypted at rest and so you can enable encryption for data and transit as well between your azure file storage and maybe some of your applications that are communicating with the file share now relational databases store data and relational tables but sometimes the structure imposed by this model can be a bit too rigid and often this leads to poor performance unless you have a lot of time to really fine tune some of those implementation problems so other models collectively known as nosql databases exist to help overcome some of those issues now these models store data in other types of structures such as documents graphs key value stores and column family stores azure cosmos db is a multi-model nosql database management system so azure cosmos db is able to manage data as a partitioned set of documents now a document is a collection of fields typically identified by a key the fields in these documents can vary and a field can contain other child documents now many of these document databases use a format called json to represent the document structure so in this format the fields and document will be enclosed in those curly braces and in each field is prefixed by a name now a document can actually hold up to two megabytes of data including small binary objects if you do need to store larger blobs as part of a document you can use azure blob storage and then add a reference to that blob inside your document now documents in a azure cosmos db database are organized into containers so the documents in the container are grouped together into partitions partition holds a set of documents that share a common partition key so you designate one of the fields and your document as the partition key now you need to select this partition key quite carefully because it will relate all your documents together so this approach will help reduce the amount of i o or input and output operations that a query might need to perform to retrieve a set of documents for a particular entity now in a document database for example if we are looking at an e-commerce system and we are recording the details of customers and the orders that they've placed we could say partition the data by customer id and then store the customer and order details for each customer in the same partition which means to find the information and orders for a single customer we simply need to query that one partition now i'm seeing like that there is a bit of similarity between azure cosmos db container and a table in azure table storage in that data is partitioned so documents and and cosmos db and rows in the table and that they're both identified by some unique id within a partition but that's sort of where the similarity ends so unlike azure table storage documents and the cosmos db partition aren't actually sorted by id instead azure cosmos db maintains a separate index so this index contains not only the document ids but it also tracks the value of every other field all the fields inside that document and this index is created and maintained automatically so that you can use it to perform queries such as trying to query for a particular field inside those documents in your container without having to scan through the entire partition or all the documents in that container to find the data that you're looking for now cosmos db provides a couple of apis to allow you to access the documents and it also provides support for other nosql database management systems through these apis so these apis are the table api the mongodb api the cassandra ap api the gremlin api and table api is essentially an interface that allows you to use the azure table storage api to store and retrieve documents the purpose of having this api is really to allow you to switch between table storage to cosmos db without requiring you to modify any of your existing applications the mongodb api is used to support any mongodb databases that you might already be using for your applications so mongodb is another well-known document database it's got their own programmatic interface and many organizations actually use it on their premises within the organization so you can use the mongodb db api for cosmos db to enable a mongodb application to run against your azure cosmos db without having to make any changes to your applications so that you can migrate the data from your mongodb database to a cosmos db database running inside of azure and continue to use the same applications to access the data inside cosmos now the cassandra api is used to support cassandra which is a column family database management system so this is another database management system that again a lot of organizations run on premises and the cassandra api for cosmos provides a cassandra-like programmatic interface for cosmos db so cassandra api requests are mapped to cosmos db document request just like the mongodb db api the primary purpose of the cassandra api is really to enable you to quickly migrate any of your existing cassandra databases and applications without having to make many changes to it to cosmos db and the gremlin api while the gremlin api implements a graph database interface to azure cosmos db so again a graph is a collection of data objects with directed relationships and data which is held in a set of documents and azure cosmos db can actually be queried using the gremlin api which allows you to perform those graph-like queries over your data that's held inside your azure cosmos db now the principal use the main use of table mongodb and the cassandra apis is really to support any of your existing applications that are already using these kind of data stores as their back-end as their data tier so if you're building out any new applications and databases you should really use the sequel api or the gremlin api so azure cosmos db is very suitable for some of these uh scenarios so if we're talking about retail and marketing uh microsoft actually uses azure cosmos tv for our own e-commerce platforms that run as part of the windows store and xbox live and azure cosmos db can actually be used quite extensively in the retail industry to for example store catalog data or for event sourcing in order processing pipelines in the gaming industry the database tier is quite a crucial component for gaming applications so modern games perform graphical processing on mobile and console clients but they typically rely on cloud services to deliver these customized and personalized contents like in-game stats social media integration and high school leaderboards games often require single millisecond latencies for reads and rights to provide that engaging in-game engaging in-game experience and so a game database needs to be very fast and it needs to be able to handle massive spikes and request rates especially during new game launches on new feature launches and updates azure cosmos db is also able to support iot systems and devices so in iot systems typically a lot of data is ingested very very frequently because these data are generally being generated on a second by second basis and so using azure cosmos db we're able to ingest and store this data very very quickly so then this data can be used by analytic services like azure machine learning or azure hd insight or power bi to then run analytics workloads on top of it additionally you can also process the data in real time by triggering an azure function to help filter and trigger and to process some of that data as well as it comes into your azure cosmos db database lesson two explore provisioning and deploying non-relational data services in azure now there are a couple of ways we can provision azure cosmos db we can provision azure cosmos db through the azure portal which we've already seen how we we can do that with the azure sql database so it's a similar approach using the azure portal we can also provision azure cosmos cb through the azure command line interface or the azure cli so typically you use the azure cli to execute cli commands through either your command line prompt or through the azure cloud shell so cli is suitable if you need to automate service creation you can store cli commands and scripts and you can run these scripts programmatically and cli can actually run on windows mac os and linux computers azure powershell is also another way you can provision an azure cosmos db so if you're familiar with powershelf this is kind of the alternative to azure cli so again you can use powershell commands to script and automate tasks you can also use powershell to create and manage azure resources just like cli and similar to cli it's also supported by windows mac os and linux you can also use azure resource manager templates or arm templates to execute your deployments now arm templates is essentially a declaration of your infrastructure and its configuration and set up as code so as json code so this this code that's formatted in json can then be used to uh deploy your azure resources and provision your azure resources like an azure cosmos db so let's take a look at how we can use the azure portal to provision an azure cosmos db database so i'm on my azure portal now and i'm going to provision an azure cosmos db account then i'm going to create a database inside the azure cosmos db account and inside that database i'm going to spin up a container for my documents so again i'm just going to look up cosmos db from my search bar and i click into it and i'm going to go add now i need to provide a couple of configuration details just like the azure sql database that we set up earlier so i'm going to start with the basics i need to choose the subscription which this azure cosmos db will be deployed to i also need to put it inside a resource group so i'm going to look up the demo one that i set up earlier now i also need to give my azure cosmos db account a name this name needs to be unique so i'm just going to say cosmos and enter some random numbers the api section is where you can choose what kind of api queries we want to use against our azure cosmos db databases so if i open up this drop down we can see that the core sql api is supported the mongodb api cassandra table storage api and the gremlin one so i'm gonna just pick the core sequel api now i'm gonna leave notebooks switch off it's still in preview and for location this is where i'm going to pick which azure region i want to deploy my azure cost mods db databases to so i'm going to go and find east u.s because that's what i've been provisioning all my services so far and then in capacity mode you get to pick between provision throughput and serverless now the difference between the two is provision throughput is essentially dedicated resources for your cosmos db to run serverless is essentially resources that will be provisioned to you when you need it and so it's not constantly dedicated to you so if you've got very light workloads or you're doing dev test scenarios you can use a serverless option however if we're talking about provision uh production workloads you should really consider using the provisioned throughput option because this will give you dedicated resources for your cosmos db databases we can also apply a free tier discount to our azure cosmos db account now this free tier can only be enabled once per account per subscription and you'll get 400 rus per second and 5 gigabytes of storage for free and so i'll use means request units and request units you can think of it like the computes and a whole bunch of other resources that is needed to actually run your cosmos db databases and this actually increases as the throughput for your azure cosmos db database requirements increases as well and so if we apply this discount we actually save around 24 a month but i'm not going to apply it because i've actually already created an azure cosmos db account so it's not going to be applicable to me i can also choose what kind of azure cosmos db account this is so is it a production one or a non-production one i'm going to leave the non-production one because this is a demo azure cosmos db and here i can come and configure things like geo redundancy multi region rights and availability zones so geo redundancy if i enabled it this means that i'm going to replicate some of my azure cosmos db databases to other regions other azure regions across the world and so that means if my primary database somehow encounters some errors or natural disaster happens i can quickly and easily fail over to some of my other databases running in other regions that are not affected by this disaster for example multi-region rights if i enable this it means i can actually create several master copies of my databases located in other regions around the world other azure regions around the world and what that means is i can actually do write operations to multiple different nodes or databases across across the azure regions and so if i enable multi-region rights i can write to multiple master databases availability zones is all about provisioning an azure cosmos db for high availability so what that means is your cosmos db databases will actually be located into different availability zones which means that they will be highly available so if something happens to one zone you'll guarantee that there is always going to be other running instances in some of the other zones i'm going to disable and turn off all those features for now just because this is a demo cosmos db and i'm going to head over to the networking tab now networking tab similar to the azure sql database this is where you configure the network connectivity so we can choose from all networks which means any networks around the world from the internet from azure from anywhere on premises can access the azure cosmos db or we can choose selected networks so the public endpoint will allow you to only allow access to the cosmos db from a couple of networks or we can select the private endpoint which means we'll block of all public access to our cosmos db account i'm going to go back and just enable all networks and i'm going to head over to the backup policy so backup policy is where you specify how frequently we want to make backups of our azure cosmos db databases and how long we want to retain those backups for and so you can specify that in the backup interval and the backup retention rates now i'm just going to leave all the default values and head to the encryption tab so azure cosmos db databases supports data encryption and you can either use the microsoft service manage keys manage encryption keys or you can bring your own encryption keys to encrypt the data that is stored inside your azure cosmos db databases i'm just going to leave the service manage keys because i don't have a key to provide my cosmos db accounts and i'm going to head over to the tags section so again in the tag section we can add some metadata to our azure cosmos cb account to provide more information about what this particular azure cosmos db account is being used for i'm not going to add any tags and i'm just going to head over to review and create that's going to run a small validation check just to make sure that some of my configuration details i've given it is correct and valid and once that's passed i can then go and create my azure cosmos db account so that's going to submit it to azure to go and deploy my azure cosmos cb account for me now once that has been deployed we can then create an azure cosmos db database and then inside that database we can create a container which is where our documents will be located and so because and worry of time i've actually set up a cosmos db to look at so while this one actually goes and deploys let's have a look at a cosmo cb that i've spun up earlier so there you go i have one that i've set up earlier called my dp my dp900 cosmos db so let's just go into it and again on the overview page we can see some of the credentials and details about this particular azure cosmos db account so i can see for example uh where the endpoint to access my uh this particular azure customer cv account is the region that it's been hosted and deployed to and a bunch of other other details as well now if i go into the data explorer tab this is like the query explorer tab that we saw in the azure cosm in the azure sql database but this one is for cosmos db specifically so if i head into the data explorer tab i can directly create some azure cosmos db databases and containers from the data explorer and as you can see i've actually created a couple of databases already so i've got one called demo database and i have one called volcano so if we drill down into the demo database one i can see that i've already got a container inside this particular database so this container is called demo volcano data now i can also come up here and create a new container so i'm going to create a new container and i need to give my container an id so i'm just going to call this uh my demo container i need to give this container a partition key so i'm just going to say this containers partition key will be id for the data that the document data i'm going to be uploading to this container um and the throughput is where we specify the number of ru's to provide this particular containers and so we can see that the lowest start at 400 and the highest is a hundred thousand ius we can provision for this particular container so you can choose between manual which means if i specify 400 400 ius will provision to this container or i can take auto scale which means the ius allocated for this container will automatically scale depending on our workload needs and so i'm actually going to turn that off and go back to the smallest one um i can also configure analytical store so i'm just going to leave that off for now and i'm going to click ok after that and that's going to go and then create my container for me inside my azure cosmos tv database the data migration tool can be used to import data to your azure cosmos db from a variety of sources so it's available to be downloaded from github and the tool actually guides you through the process of migrating data into your azure cosmos db databases so you're prompted for the source of data inside the azure cosmos db migration tool as well as the destination which is essentially the azure cosmos db database and the container that you want to migrate the data to so the tool can be used to either populate an existing container or create a new one if the specified container doesn't exist already and you can also use the data migration tool to export data from an azure cosmos db container to a json file uh that can be either locally uh hosted or posted on azure blob storage for example if you have a lot of data the data migration tool can also make use of multiple concurrent threats to batch your data into chunks and then load those chunks in parallel so let's take a look at how we can use the azure cosmos db migration tool to load some data into the azure cosmos db database that we just set up so i've opened up the azure cosmos db data migration tool and this is what it looks like i've downloaded this from github and i need to provide a couple of information so the source and the target information so we come into source again we can import from a couple of sources so json files mongodb sql csv and so on and so forth now i'm going to go and add a json file from my desktop so i'm just going to go add files i'm going to pick this particular volcano data json file and i'm just going to open that now i need to provide the target information which is essentially my azure cosmos db database to connect to and the container to import to so i need to come and grab the cosmos db connection string which means i need to go back to my azure portal and find my azure cosmos db that i set up to grab that connection stream so let's just go do that so this is my azure cosmos db account i'm going to go into it and i'm going to need to go and find that connection string to then be able to input into my target so if i scroll down a little bit and it's on the keys so i'm going to grab my connection string from keys so if i scroll a little bit more i can find the primary connection string i'm just going to copy that now i'm going to hit back to my data migration tool and i'm just going to paste that into the connection string option now i also need to specify the database to connect to so the database is if i go back to my azure cosmos tv to find out the database name i'm going to head into the data explorer and i can see that it's called demo database that's the one that i created earlier so i'm just going to enter demo database into my data migration tool and then i'm just going to hit verify and this is going to check whether or not i can connect to my azure cosmos tv so that connection is successful so i'm just going to close this and now i need to specify the collection or the container to import my json files to and so earlier i created a demo container to import my my json files to so i'm just going to specify it's the demo container that i want to import my json files to and i also need to specify the partition key for the data so i'm just going to say id as my partition key and i'm going to change the throughput to match the one that i assigned the container so i'm just going to change that to 400 ius and if i go next now you can specify some of these advanced configurations as well i'm not going to i'm just going to hit next and i get a summary of my configuration details for this particular data migration once you've checked that everything is accurate we can simply hit import and that is now going to start importing my json files up to my azure cosmos db database and inside that particular container so let's have a look at how authentication works in azure so when a user or application attempts to authenticate they need to specify the domain to authenticate against now azure active directory will typically require the user or the application to provide some credentials often a password to prove their identity it's also possible to configure multi-factor authentication to reduce the possibility that a fraudulent user with a stolen password is trying to authenticate against the service for example you could specify that not only does a user have to provide a password but they also need to respond to a text or a phone message sent to their mobile device before authentication now if the authentication is successful azure active directory creates an authentication token for the user or the application to use which means when that user or application attempts to access a resource this authentication token is passed as part of the request to service and the token is then used by the service to authorize the access to that resource now configuring an azure storage account will allow you to access many storage services in azure such as blob storage file storage table storage and queue storage we can even set up an azure data lake storage through an azure storage account so let's take a look at how we can actually set up an edge of data like storage through an azure storage account on the azure portal so i'm back in my azure portal and i'm just going to look up a storage account i'm going to click into my storage account and ignore the previous ones that i've got i'm going to go at and again i need to provide some basic credentials and other settings to create my azure storage account so i'm going to start with the subscription which is going to my internal subscription i'm going to put it inside the same resource group as both my sql database and my cosmos database i need to give my storage account a name and this name needs to be unique which means it needs to be a storage account name that no one has used across azure before and so let's just call this storage account and hopefully that's unique yes so that's past now i'm also i also need to pick the azure region to deploy my storage account too so i'm just going to leave the acus region i can choose the performance level here so do i want to create a storage account that is under the standard tier or use the premium tip so i'm just going to leave the standard one and under account type this is where i can choose what kind of azure storage account i want to set up so general purpose v2 is actually the newest and latest version that we've got so i'm just going to keep that as is and under replication this is where you can choose the redundancy options for your azure storage account so redundancies or replications start from lrs or locally redundant storage this means three copies of your data will be replicated inside the same azure region and the same azure data center zone redundant storage means that three copies of your data will be replicated within the same azure region but they'll be distributed across the different availability zones in that region geo-redundance storage means three copies are replicated within your primary region as well as three copy ps will be replicated and placed in a secondary azure region and there are other redundancy and replication methods as well you can you're more than welcome to go and read up on some of the microsoft docs on exactly how those replication works i'm just going to go with the lowest one so the lrs option and then i'm going to go next to networking so again this is where we configure how we want to connect to our storage account so do we want to enable all networks to be able to access our azure storage account or only a selected few networks or no public access at all and the storage account can only be accessed via a public endpoint i'm just going to open this up to all networks you can also define some network routing rules if you want i'm just going to leave the default one for now and i'm going to head into data protection so data protection we can do things like configure how we want to recover some of the data stored inside our storage account so we can turn on point in time restores for containers we can turn on softly for the blobs uh we can turn on soft delete for file shares if we're using azure file storage we can turn on versioning for the blobs and we can turn on the change fee for the blobs as well so i'm just not going to turn any of this on because this is a demo storage account and i'm going to head into advanced now in advance we can configure some of our security settings so do i want security transfer fur enabled and i'm going to enable it you can select the tls version so there is the option to pick from 1.0 1.1 to 1.2 just gonna leave the default 1.2 and and under blob storage this is where you pick those tears those access tears to your blob storage account so if we enable public access it means we've got we can have public access to our blob storage accounts and the access tear is where you choose whether the blob storage account will hold blobs under the hot tear or the cool tip so i'm just going to leave these default configuration now the data lake storage gen 2 section is where we want to specify whether or not we want to create this storage account as a data lake storage and so data lake storage is essentially an extension built on top of azure storage or azure blob storage so that there is a hierarchical file system that you need to define to store your blobs and this is how you typically would set up a data lake in azure so i'm going to actually enable that because i want to set up that data like storage i can also come in here and i can set up my azure file storage if i enable it i can also set up tables and queue storage i'm not going to do that for now and i'm just going to go next to tags and again we can add some metadata to describe exactly what type of a storage account this is so for example i can say this storage is a data lake maybe and then i'm just going to go into review and create and it's going to run that validation check to see that i've passed all the right configurations and once that validation test has checked out i simply hit create and that's going to then create my data lake storage for me this concludes the first part of module three so let's go take a ten minute break and then we'll come back and continue exploring non-relational data in azure hey [Music] hey [Music] wow [Music] oh [Music] welcome back from the break so now we're going to take a look at how to query non-relational data in azure so let's continue with module three lesson three manage non-relational data stores in azure so when an application writes data the change is recorded by one server and then propagated to the other servers and the cluster asynchronously while this strategy helps to minimize latency it can lead to inconsistencies in the data for example application 1 might make changes to a document and then application 2 reads the same document but if the application to retrieves the data from a different server in the cluster then what application one use to write to there will be a small window of opportunity during which application 2 might see some of the old data now this issue doesn't just arise from multiple applications inconsistencies can occur in a same single application if a right is directed towards one server in the cluster but a subsequent read application read operation is performed against a different server now in azure cosmos db there are a couple of options that allows you to specify how inconsistencies should be handled and these options are eventual consistent prefix session bounded stillness and strong with eventual this option is the least consistent it's based on the situation that we've just described so changes won't be lost they'll appear eventually but they might not appear immediately additionally if an application makes several changes some of these changes might be immediately visible but others might be delayed so the changes could appear out of order with the consistent prefix option this option ensures that changes will appear in order although there may be a bit of a delay before they become visible and in this period an application might actually see some of the old data with the session option if an application makes a number of changes they'll all be visible to the application and in order other applications may see some of the old data although any changes will appear in order as they did for consistent prefix now this form of consistency is also sometimes known as read your own rights boundless selfness is where there's a lag between writing and then reading the updated data so you typically specify this stillness as a period of time or a sort of window or number of previous versions where this data may be inconsistent and then with the strong option this is where all rights are only visible to the clients after the changes have been confirmed successfully written to all the replicas and so this option means that sometimes when your applications are trying to read from some of the replicas it might be unavailable because some of these changes might actually be trying to be made so that's when your application does read the data it's all consistent and so eventual consistency provides the lowers latency and the least consistency while strong consistency results in the highest latency but it also provides the greatest consistency and so you should always choose a consistency level or option that balances that performance and requirements that your applications need now although azure cosmos db is described as a no sequel database management system the sql api the core sql api actually enables you to run sql-like queries against your cosmos database so these queries again use very similar syntax to sql but there are some differences this is because the data that we're trying to query against your cosmos db is actually structured as documents rather than tables so if we were to use the course equal api to query documents the cosmos db sql api supports a dialect of sql for querying documents using select statements much similar to the select statements in the relational database and the sql api actually returns results in the form of json documents rather than the results that you typically see with or the table rows that you typically see with relational databases and all these queries must be executed in the context of a single container now just to understand a bit of the core sql api we have a select clause so the select clause starts with the select keyword followed by a comma separated list of properties to return the star usually means all the properties in the document the from clause so this clause starts with the keyword from followed by an identifier which represents the source of the records and you can usually use an alias as well to identify where the source is coming from so in the relational database query that from clause would typically contain the table name now in using the sql api for cosmos db all the queries are limited to the scope of a container so the identifier of the from clause would represent the name of the container the where clause is again another optional clause so it starts with the keyword where followed by one or more logical conditions that must be satisfied by a document returned by the query so you can use the where clause to filter results of some of your cosmos db queries and the order by clause is again an optional clause so it starts with the phrase order by followed by one or more properties used to order the output result set now a query can also contain a join clause so in a relational database management system like azure sql database the join clause is used to connect data from different tables together in the core sql api you typically use the join clause to connect fields in the document with fields in the sub documents that is part of the same document you can't perform join operations or queries across different documents like you would with the tables in the relational database so let's go back to our azure cosmos db in our azure portal where we've actually imported some of that data into our cosmos db and let's have a look at how we can query some of that data using the core sql api so i'm back in my azure cosmos db account i'm going to go and navigate to that database and container that i created earlier by going to data explorer now this is going to open up some of the databases that i've created so the demo database if i open that up there's my demo container that i created earlier so i can come in here and hover over these three buttons and i'm just going to click and go create a new sql query now by default it's given me this uh select all from container statement so if i executed this it's actually going to retrieve all the documents that's currently inside this particular container so i can quickly try and run this i'll highlight it and execute selection and as you can see all the documents inside this container is returned now that's not a very interesting query so let's do something with it i know for example that this is volcano data because that's what i imported earlier so i can draw down into here and look into some of this data with the select or from container statement so let's have a look at uh something even more interesting so if i do a select all from the container and i set a wear condition let me just minimize this a little bit i'll set a where condition and i know that by looking at this data there are some fields in my contain and my documents so there's a field for example like country so if i wanted to say for example specify uh the country field it would be let's say japan and hit execute now if i pull this up again and let me just minimize this so that i have more space i pull this up now i'm getting all the documents where it contains volcanoes that are located in japan so as you can see i'm still getting a lot of documents back from my container and my query so let's filter it a little bit more now another field is elevation so let's see if i can query with country and elevation so i'm going to put an ant elevation let's say it needs to be greater than 1000 and i'll hit run and now we see that our result set has reduced down to 56 documents which matches this particular query so now if i scroll a little bit i can see that these are all the volcanoes in japan where its elevation is greater than 1000 now let's go ahead and talk a little bit more about some of the aggregate functions that can be used with the core sql api as part of the cosmos db so you can use aggregate functions to summarize data and select queries so you place aggregate functions in the select clause the core sql api query language supports a couple of aggregate functions like count sum average max and min so as you can imagine count is where the function will return a count of the number of instances a particular field and the result set occurs some will return the sum of all the instances of the field in the result set average will return the mean value of all the instances of the field in the result set so the values typically are numeric and max and then will typically return either the maximum value of the field or the minimum value of the field from the result set so if we go back to the azure portal let's have a look at how we can use some of these aggregate functions to again query our volcano data so i'm back in my data explorer and now let's use some of those aggregate functions to query our container so i'm going to use the count one and i'm gonna try and count all the volcanoes in japan for example so let's see what we can do we can change the star to count everything in the container from the container and i'm going to leave the condition as country equals to japan and just remove the elevation i'm going to then highlight this and i'm going to execute the query and we can see that 111 is returned now that doesn't really look nice so i'm going to give it an alias so count uh as toto vo kanos in japan let me just run that query again and now i get the result back that the total number of volcanoes in japan is 111. now let's say we want to use a different aggregate function maybe for example we want to find the deepest volcano in japan and so we can use the max function to do that so let's do and change this to max and we want to change this to c dot elevation and we'll change the name as well so this is going to be deepest volcano in japan and the from and where condition statement we can leave as is so i'm just going to highlight this and execute this and we see that the deepest volcano in japan has an elevation of 3776 and these are just some of the ways that you can use the core sql api to execute sql like queries against your cosmos db database so azure blob storage supports three types of blobs as we already discussed so block blobs page blobs and the pin blobs now you typically use page blobs to implement virtual disk storage for your azure virtual machines for example they're optimized to support random read and write operations a pin blobs more suitable for storing data as it grows in chunks so if you've got log data or archive data and block blobs are the best to be used for static data and so they're the most appropriate type for storage that needs to be holding things like image data or video content for example so in an azure storage account to store blobs you would need to create a container and so this container will grip all your related blobs together so that you can organize the blobs in the hierarchy of folders inside that container so let's go back to our azure portal and see how we can create a container and upload a blob and also see how we can download that blob as well so now i'm back in my azure portal and what i want to do is create a container in my blob storage and then upload and inlet so i'm just going to go into my storage account and i'm going to find that storage account i created earlier so if i scroll down a little bit there it is now i'm on the overview page of my storage account and if i scroll a little bit down i can see some of the other types of storages under a storage account that you can create so for example containers means blob storage file shares means azure file storage tables is where you can create your azure table storage and queues is where you can create azure queue storage so i'm just going to click into containers now i don't have a container set up for my blob search so i'm going to go and create one so i'm just going to go add container and i'm going to give my container a name so let's say i wanted to upload some image data to this container so i'm going to call this images and then i can also define the access level for this container so if i hover over this drop down i can see that there's three types of access level i can grant to this container private access means no anonymous access at all blob access means there is anonymous read access that can be granted for blobs only and if we set this container to have container level access that means anonymous read access for both the container and the blobs are allowed so let's actually stick with private i'm going to create a container now my container has been successfully created so i'm just going to go into it and i'm simply going to upload some image data so i'm going to go upload and i'm going to go and select a local file from my desktop or folder and here you go i've got some image data here so i'm just going to pick one at random and i'm going to pick this one i'm going to open it and then i'm just going to upload that to my blob storage container now that's been uploaded so i'm just going to close out of this and because i've set the container to have private access only which means there won't be any anonymous access at all so to be able to access this blob i would need to get some sort of authenticated authorized access rights to it so what i can do is actually i can grab a sas token which then gives me a time frame or time period with which i get uh authenticated access to this particular blob if i can present the sas token so if i select this and i come into these three buttons here the three buttons here actually and then i can generate a sas token from this blob now what that means is i can come in here and i can create a shared access signature success which will then grant me access to this blob and so if i scroll down here a little bit i can assign a couple permissions so i can give it read create write delete access i'm just going to give the read access for now and i can specify a start and expiry time for when i can actually access this blob so i'm just going to leave all the default values i can also choose which protocol to allow and then i'm just going to hit generate sas token and url and here you go now i've got the sas token which means if i copy this says url and i go into a new browser i hit paste and enter i'm able to grab this particular blob from my storage account now let me just close out of this and i'm going to go back to my container now if you wanted to delete this blob it's quite simple we just select the blobs that you want to delete from your blob storage container and we hit delete and now because i haven't enabled soft delete which means when i delete this blob it's going to be permanent i can't actually recover it so i'm just going to go and delete this blob and there you go now my blob has been successfully removed so you can also use azure file storage to store shared files which means a user that connects to this shared folder or file share can read and write files if they've got the appropriate privileges that is in much the same way that they would use a folder on the local machine now to create a file share we use azure file storage under the azure storage account much like the blob storage that we saw earlier and so let's jump it back into the azure portal to see how we can create a file share under azure file storage so now i'm going to go back out into my storage account this is the same storage account and i'm going to scroll down a little bit and so now instead of containers i'm going to go over to file shares to create my azure file storage i'm going to open that up and i can just go and create a file share by going add file share and i need to give my file share a name so i'm just going to call this my shared files now i also need to allocate a quota to this file share so i'm just gonna say one uh gigi bytes and i can also select the test for this file share so other files going to be kept in the call hot or transaction optimizer or premium tier so i'm just going to leave the default transaction optimized and i'm going to come in here and create and there you go i've created my azure file shares so now that we've created our file share and azure file storage we can use a utility to start copying files over whether that's from our local desktop or from some other sources and so we can use the a utility tool called az copy to copy files over to our file share in azure file storage so let's head back over to our azure portal to see how we can do so so now i'm back in my azure portal and i've created that file share earlier so if i scroll down here and have a look here is my file ship called share files that i created earlier so now what i want to do is i want to generate a another sas or shared access signature token to be able to access that file share so that i can start copying files over so i'm just going to allow access to files i'm going to turn all this off and the resource type i'm just going to select all of them now for this i can select a couple of permissions i'm just going to leave the default one that's already chosen for me and the you can configure the start and expiry date time but i'm just going to leave the ones as is and then i'm going to generate that sas and connection key and now here you go i've got my file service says url that i can then copy and now how i'm going to copy over files is i'm actually going to use the azure cloud shell which contains the az copy utility that i can use copy files over to my file share so to access the azure cloud shell i simply need to open up the azure cloud shell from here now with the azure cloud shell it supports both azure cli command as well as azure powershell commands and you can actually toggle that from by changing it over here but i'm already in the bash one or the azure cli one so i'm just going to continue using this now i need to first upload a file to my cloud shell to then copy over to my file share so i'm going to come into here and i'm just going to upload an image from one of the ones that i've got earlier so i'm just going to pick the one next to the one that i've already used before and i'm going to open this one up okay so now i just need to execute the azid copy utility to copy over this file that is currently sitting in my azure cloud shell over to my file ship so i'm just going to execute a zed copy i'm going to copy over that file which i've forgotten the name of so i'm just going to go into my upload files and i'm going to have a look at the name for it so it's called windows 1.0 webp so let me just put that in windows 1.0 and let me just double check that so uh yes that's correct because that needs to be important and then i need to import the file service sas url that i copied earlier because that gives us authenticated access to the file share so in quotation marks again i'm going to paste that sas url and now i actually need to make some changes to this sas url to specify the file share name as well as the image file to copy over so um i need to actually modify this so i'm just going to put an end quotation mark here and i'm going to go up so this question mark is where the sas token actually starts and before that i need to specify the file share name which i called shared files as well as that image that i want to image file that i want to copy over so it's called windows 1.0 and hopefully when i press execute this is going to then use the az copy utility to copy that file from cloud shell over to my file share and now it says the job has been completed so if i close out of my azure cloud shop i go back to my file share i open my shared file share up and there you go we see that windows 1.0 image file has been uploaded to my file share so this concludes module 3. i hope you've learned a little bit more about non-relational data services in azure so let's go take a 10 minute break and after break we're going to start exploring module 4. bye [Music] welcome back from the break so now we're going to take a look at module 4 where we're going to explore data analytics solutions in azure so first we're going to take a look at what a modern data warehouse looks like we're going to didn't take a look at some of the data ingestion services in azure we're also going to explore some of the data storage and processing capabilities in azure and finally we'll wrap up with some microsoft power bi so let's get started with module 4. lesson one examine components of a modern data warehouse so the data from relational and non-relational services is retrieved and formatted by using azure data factory this process is known as ingestion the formatted data is then written to azure data lake storage now azure data lake storage enables you to store large volumes of data quickly and easily prior to analyzing it at this point the data can proceed down two parts so the data is converted into a normalized format suitable for analysis and can be stored into azure synapse analytics services alternatively you can use azure databricks to perform other forms of data preparation for example you might need to perform additional data transformations or cleaning to remove anomalies and generally tidy up the data set you can then store the clean data using synapse analytics if you require it now azure syntax services acts as a hub holding this cleaned business data you can then perform a detailed analysis of the analytics that the data has been cleaned and processed into using azure analysis services so azure analysis services enables you to dig deeper into the data and generate insights from the information it contains then microsoft power bi can take this information and use it to generate graphs charts and reports and power bi can also run ad-hoc queries against the data processed by databricks now a modern data warehouse might contain a mixture of relational and non-relational data including files social media streams and internet of things or iot sensor data azure provides a collection of services you can then use to build a data warehouse solution including azure data factory azure data lake storage azure data bricks azure synapse analytics and azure analysis services you can also use tools like power bi to analyze and visualize the data generating reports charts and dashboards now a typical large-scale business requires a combination of up to the second data and historical information so the up to the second data might be used to help monitor real-time critical manufacturing processes for example when instant decision is needed to be made now other examples includes streams of stock market data where the current prices might be required to make informed split-second buy or sell decisions on the stock market historical data is equally important because it gives the business a more stabilized view of the trends and how it's performing over time so for example a manufacturing organization might require information about the volume of sales by product across a month or a quarter or a year to determine whether to continue producing certain items whether to increase or decrease production according to seasonal fluctuations and all that kind of stuff so historical data can be used to generate insights to help make those business decisions now historical data are generally batch processed at regular intervals so they could be based on live sales data that might be captured through a continuous stream and then we capture them into batches and then submit it to a batch shop to be processed one batch at a time or multiple batches at the time now any modern data warehouse solutions must also be able to provide access to streams of raw data and the process information that is derived from this raw data azure data factory is a data integration service so the main purpose of azure data factory is to retrieve data from one or more data sources convert it into a format that you can then process so the data sources might contain data in various formats and types so it could contain a lot of noise that you need to first filter out so using azure data factory you can extract out and filter out some of those data so that you can keep the relevant and interesting data and throughout the rest now even the interesting data might not be in a suitable format for processing by some of your other services in your data warehouse solution and so typically you need to also transform it for example your data might contain dates and times that's been formatted in different formats and so you need to standardize that you can use azure data factory to transform some of these items into a single uniform structure and then azure data factory can then write out the ingested data to a data store for subsequent processing so you typically define the work performed by azure data factory as a pipeline of operations and this pipeline can run continuously or as data is received from a data source so you can create these pipelines using the graphical user interface inside azure data factory or you can write your own code now a data lake is a repository for large quantities of raw data because the data is raw and unprocessed it's very fast to load and update but the data hasn't been put into a structure suitable for efficient analysis so you can think of a data lake as a staging point for your ingested data before it's then massaged and converted into a format suitable for performing analytics now the difference between a data warehouse and a data like storage is that data warehouse stores large quantities of processed data and the data like storage stores large quantities of raw data that needs to be processed and converted into a format that supports efficient analysis now azure data lake storage combines that hierarchical directory structure and file systems of a traditional file system that you might be familiar with with the security and scalability features provided by azure and so azure data lake storage is essentially an extension of azure blob storage organized into a near infinite file system azure databricks is an apache spark environment running on azure to provide that big data processing streaming and machine learning capabilities so apache spark is a highly efficient data processing engine that can consume and process large amounts of data very very quickly so there are a significant amount of spark libraries you can use to perform tasks like sql processing aggregations um or even to build and train your own machine learning models using the data through azure data bricks so azure databricks provides a graphical user interface where you can define and test your processing step by step before submitting it as a set of batch tasks so you can create data break scripts and query data using languages like ah and scala you can also write your spark code using notebooks so a notebook contains cells and each of these styles contains a separate block of code so when you run a notebook the code in each cell is then passed to this to spark for execution cell by cell now azure data bricks also supports structures stream processing so data press can be used to perform computations incrementally so as your continuous stream of data arrives we can continuously update results from the continuous stream of data synthetic analytics is an analytics engine so it's designed to process large amounts of data very quickly using synapse analytics you can ingest data from external sources so these external sources could be flat files azure data lake or some of your azure some of your database management systems and then transform and aggregate this data into a format that's suitable for analytics processing you can then perform very complex queries over this data and then generate reports graphs and charts from the results now reading and transforming data from an external source can take a lot of resources which is why azure synapse analytics enables you to store the data that you have written and process it locally within the service so this approach enables you to repeatedly query the same data without the overhead of fetching and converting it each time so you can also use this data as input to then further your analytical processing using azure analysis services azure analysis services leverages a massively parallel processing architecture so this architecture includes control nodes and uh it enables you to build tabular models to support your online analytical processing queries so you can combine data from multiple sources like azure sql database azure synapse analytics azure data like storage uh azure cosmos db and many many other more and then you can use these data sources to build models that incorporate your business knowledge so a model is essentially a set of queries and expressions that retrieve data from the various data sources and that can be then used to generate results so the results can indication memory for later use or that can be calculated dynamically or directly from the underlying data sources now analysis services includes a graphical designer to help you connect data sources together and define queries that combine filter and aggregate data so you can explore this data from within analysis services or you can use tools like microsoft power bi to help visualize the data presented by these models so azure analysis services has a bit of overlap with azure synapse analytics but it's actually more suited for processing data on a much smaller scale so typically we use azure analysis services for small volumes of data so a couple of terabytes for example we also use azure analysis services to correlate multiple sources it supports high re-concurrency so that you can have thousands of users connect to it at the same time it also supports detailed analysis so that you can really drill into your data or use functions from power bi to be able to get that detailed insight into your data you can also create very rapid dashboards from your data using azure analysis services so a lot of scenarios actually benefit from combining the use of synapse analytics and analysis services together so let's say you've got a large amount of ingested data that require pre-processing so typically you would use synapse analytics to read this data manipulate it into a model that contains the business information rather than the large amount of raw data and then you can use analysis services to really investigate and get insights and drill down into some of these business models and information now the scalability of synapse analytics also gives synapse the ability to process uh you know terabytes of data into smaller succinct data sets that summarizes and aggregates a lot of this data then washington can use azure analysis services to then perform that detailed interrogation of this information and then you can visualize the results through power bi so azure hd insight is a big data processing service that provides a platform for technologies such as spark in an azure environment so hd insight implements a cluster model that distributes processing across a set of computers and this model is similar to the the one used by synaps analytics except the nodes are running spark processing engine rather than azure sql database so you can use azure hd insight in conjunction with azure synapse analytics so as well as spark hd insight also supports streaming technologies like apache kafka and apache hadoop so hadoop is an open source framework that breaks large data processing problems down into smaller chunks and distributes them across a cluster of servers sort of in a similar way that synapse analytics operates now azure hd insight also supports apache hive which is a sql like query facility that you can use within an hd insight cluster to examine the data that's held in different formats and then you can use it to create load and query external tables in a similar way that you would use polybase for in azure synapse analytics lesson two explore data ingestion in azure so data ingestion is typically the first part of any data warehousing solution it's arguably the most important parts because if you lose any data at this point then any resulting information can be inaccurate it could fail to represent the facts that you might be interested in that you're basing your business decisions on and so in the big data system data ingestion has to be fast enough so that you can capture almost all the data that might be relevant and that's heading your way and so you also need enough compute power to be able to process this data in the timely manner now we can ingest data using azure data factory so as it ingestes the data data factory will clean transform and restructure the data before loading it into a repository like a data warehouse because once the data is inside a data warehouse we can analyze it so we need to ensure that before we load it in the data is actually ready for those analytics workloads so data factory contains a series of interconnected systems that provides a complete enter in platform for data engineers so you can load static data but you can also ingest streaming data loading data from a stream offers those real-time solutions for data that arise quickly or that changes rapidly so using streaming you can use azure data factory to continuously update the information in the data warehouse with the latest data we can also ingest data using polybase so polybase is a feature of sql server and azure synapse analytics that enables you to run transact sql or t sql queries that read data from external data sources so polybase makes use of these external data sources and makes them appear like tables in a sql database so using polybase we can read data managed by hadoop by spark by azure blob storage and other database systems like cosmos db oracle teradata and mongodb polybase also enables you to transfer data from an external data source into a table as well as copy data from an external data source in azure synapse analytics or sql server you can also run queries that join tables in a sql database with the external data which enables you to perform analytics that span multiple data sources we can also ingest data using sql server integration services so sql server integration services or ssis is a platform for building enterprise level data integration and data transformation solutions you can use ssis to solve complex business problems by copying or downloading files loading data warehouses cleaning and mining data and managing sql database objects and data now ssis is actually part of microsoft's sql server and so with ssis we can extract and transform data from a variety of data sources such as xml data files flat files relational data sources and then load the data into one or more destinations you can use the graphical tool that's provided by ssis to create these solutions without having to write a single line of code and you can also program extensive integration services service object models to create packages programmatically and code custom tasks and other package objects as well so ssis is typically used as an on-premises utility however you can also use azure data factory to run your existing ssis packages as part of the data factory pipeline in azure and so this allows you to quickly get started on building your azure data factory pipelines using your existing ssis packages without having to rewrite some of that transformation logic now azure data factory uses a number of different resources like linked services data sets and pipelines so with linked services data factory moves data from a data source to a destination so a link service provides the information needed for data factory to connect to a source or a destination for example you can use azure blob storage link service to connect the storage account to data factory or the azure sql database link service to connect to a sql database a pipeline is a logical grouping of activities that together perform a task and so the activities in the pipeline define actions to perform on your data for example you might use the copy activity to transform data from a source data set to a destination data set you could also include activities that transform the data as it is being transferred or you might combine the data from multiple sources together so pipelines don't have to be linear you can include logic activities that repeatedly perform a series of tasks while uh some condition is true using for example a for each activity or follow different processing path depending on the outcome of previous processing outcome using an if condition activity you can run a pipeline manually or you can arrange it to be run later on using a trigger a trigger enables you to schedule a pipeline to occur accordingly to a plan schedule so say for example we can trigger a pipeline to run every saturday night or at repeated intervals so we can also trigger pipeline to run every minute or every hour or when an event occurs so for example when a file gets landed into azure data like storage or a blob gets deleted from azure blob storage now data set in azure data factory represents the data that you want to ingest so the input data and store so the output data if your data has a structure a data set specifies how the data is structured so not all data sets are structured blobs held in azure blob storage for example are unstructured data and so a data set connects to an input or an output using a linked service for example if you're reading and processing data from azure blob storage you create an input data set that uses a blob storage linked service to specify the details of the storage accounts the data set will specify which blob to ingest and the format of the information in the blob so is it binary is it json as a delimited text if you're using azure data factory to store data in the table in a sql database for example you would then define an output data set that uses a sql database linked servers to connect to that database and specify which table to use in that database to output the data to lesson three exploring data storage and processing in azure so data lives in many locations throughout an organization when you design your cloud data solution you'll want to ingest your raw data into a data store for analysis so a common approach that you can use with azure syntax analytics is to extract the data from where it's currently stored load the style data into an analytical data store and then transform the data and shaping it so that it's ready for analysis now azure synapse analytics is particularly suitable for this approach using apache spark and automated pipelines synapse analytics can run parallel processing tasks across a massive data set and perform big data analytics remember the term big data here refers to data that is too large or too complex for traditional database systems so systems that process big data need to perform rapid data ingestion as well as processing which means they need to have the capacity to store the results as well as sufficient compute power to perform those analytics over the results now azure data bricks as i mentioned is an analytics platform that's been optimized for azure so azure databricks was actually designed by some of the founders of apache spark so databricks is integrated with azure to provide a one one-click setup so that we can streamline the workflows and provide an interactive workspace that enables collaboration between data scientists data engineers and business analysts with databricks we can process data held in a lot of different types of storage including azure blob storage data lake storage hadoop storage flat files databases and data warehouses databricks can also process streaming data so databricks uses an extensible architecture that's based on drivers to do so the processing engine is provided by apache spark so spark is a parallel processing engine that supports that supports large-scale analytics so you typically write application code that consumes data from one or more sources and merge reformat and filter and remodel this data and install the results so spark distributes the work across a cluster of computers and each of the computer can then process its data in parallel with other computers and so this will actually reduce the time required to perform the work and so spark is designed to handle massive quantities of data you can write your spark application code using couple languages so spark supports a python r scala java and sql spark also has a number of libraries for these languages and some of these libraries includes modules for machine learning statistical analysis linear and non-linear modeling predictive analytics and graphics so that it can support complex analytics tasks as well as optimizing your workload for a cluster environment now you typically design your application or write your databricks applications using a notebook so again a notebook would contain a series of cells each cell will contain a block of code so for example one cell might contain code that connects to a data source the next cell will then read the data from this source converted into a model a memory and then the cell of that could plot a graph and the final cell after that could then save the data from the in-memory model to a repository now azure synapse analytics is made up of a couple of components so there is the synapse sql pool there is the synapse spark pool there are synapse pipelines synapse link as well as the synapse studio so with synapse sql pool this is a collection of servers running transact sql tsql so transact sql if you remember is a dialect of sql that's used by azure sql database as well as microsoft sql server so you can typically write your data processing logic using tsql and uh execute it on an on a synapse sql port now synapse spark pool is a cluster of service running apache spark to process data so you can write your data processing logic using either python scala or c-sharp so c-sharp has in.net there is libraries which supports apache spark so spark pools can also support azure machine learning through integration with spark ml or azure email packages now synapse pipeline is a logical grouping of activities that together perform a task so the activities in the pipeline define actions to perform on your data so for example you might use a copy activity to transform data from a source data set to a destination data set you could include activities that transform the data as it is transferred or you might combine the data from multiple data sources together as part of the pipeline now azure synapse link is a component that allows you to connect to some of the other azure services like cosmos db for example so that you can use it to perform near real-time analytics over operational data that's stored inside a cosmos db database and azure synapse studio is a web user interface that enables data engineers to access all these synapse analytics tools so you can use synapse studio to create your spark or sql pools you can use uh synapse studio to define and run your synapse pipelines or you can use the synapse studio to configure links to some of your external data sources now let's have a look at a demo where we load data into adjacent net analytics so imagine that you're part of a team that is analyzing house price starter so the data set that you receive contains house price information for several regions and your team might need to report on how the house prices in each of those regions vary over the last couple of months so to achieve this we can inject the data into azure synapse analytics and we would use azure data factory to perform this task so let me just jump into my azure portal to show you how we go about this now that i'm back in my azure portal the first thing i want to create is a storage account so i'm just going to look up storage account and let me just create a storage account i'm going to put it into my demo resource group there you go and i'm going to call this housing storage account that's pretty unique i am going to put it into the east u.s azure region and i'm going to leave most of the default i'm going to change it to locally redundant storage and i'm just going to go review and create because what i really want to create is an azure file storage in my storage account so i'm going to go create now while my storage account has being created i'm going to go and also set up a synapse analytics so i'm going to look up synapse and i want to select azure synapse analytics the workspaces preview gonna go ahead add and again i provide some basic credentials so i'm gonna put it into the demo resource group now i need to give my synapse analytics name so i'm just going to call this demo synapse analytics that checks out i'm going to put it into the sus region again and now i need to create a data lake storage because i don't have one so i'm going to go create new i'm going to call this demo data lake oh someone's taken that so data lake housing there we go now i also need to give it a file system name so i'm just going to create a new one let's call it logs i can also assign myself the storage blob data contributor on the data lake storage gen2 account so i'm going to check that now i can go into security and i need to provide it some sql administrator credentials so would recommend changing this default admin username but i'm just going to leave the default and put in a password again i'm going to repeat that password i'm just going to hit next and you can configure to allow connections from all ip addresses or disable it from here gonna skip tags i'm gonna go into summary and then i'm just gonna hit create now while our synapse analytics is being created let's check on our deployment from our storage accounts so our storage account has created let's go to that resource so let me just go back to that resource there we go now i'm going to scroll on the left here and i'm going to go find file storage so file services and then under file shares what i'm going to do is create a new file share and i'm going to just call this file share housing when i allocated some quota select the hot here i'm going to go create now my file sheet has been created i'm going to go into my file share and i'm just going to upload a file i have on my local pc so i'm going to upload files select the file there we go i'm going to select the csv file and then i'm going to go upload and now it's been uploaded so it looks like our synapse analytics is still being created so why don't we use this time to go create a data factory so i'm going to look up data factories and i'm going to go add now i'm going to use my internal subscription put into my demo resource group then i'm going to pick that same region i used earlier for the other two resources so sus i'm going to give my data factory a name so i'm going to call this demo adf for example oh that's not available okay adf demo let's call it housing as well and you can select between two versions for data factory obviously version 2 is the latest version so i'm just going to leave it as this now i'm going to go to next for git configuration so you can configure uh your repository to be the azure devops or github for version control i'm going to do the setup later so i'm just going to tick the configure get later box go to networking again you can select your connectivity method i'm going to leave the public endpoint for now i'm going to skip tags and just go review and create so create that and let's check on our synapse analytics so if i go to that resource group called demo i can find all the resources that i've just created so let me filter by name and there you go so i've got my storage account my synapse workspace and uh oh so the data lake that i created dialect storage and the storage account and the synapse analytics workspace so let me just go into the synapse analytics workspace so it says the resource is being provisioned at the moment which is fine because i have actually set one up before this so that we don't have to wait so let me go and find the one that i set up before it's under the dp900 resource group and i'm going to scroll a little bit to find that synapse analytics workspace there you go now this one's called the my dp900 synapse and what i'm going to do is i'm going to scroll down on the overview page a little bit and open that synapse studio so once provisioned this is what we want to open up so the synapse analytics workspace studio and from the left hand side i can access my data and i've set up one database in my sql pool so let's have a look i've called it my synapse sql and inside this database i have one table at the moment called housing data so that's what we want to set up in our synapse analytics that we are currently deploying so let's just go check out how the progress is again i'm going to go back into my resource group and go find my synapse analytics it looks like our synapse analytics has been provisioned now just then so what i'm going to do is from here go add a new sequel pool and i need to create a sql ball so i'm going to give my sql pool a name so demo sql pull i'm going to scale this right down just because it's a demo but you can you know increase the performance level as you need now i'm going to go to additional settings i can configure you know what data source to set up my sql pool i'm going to choose none and i'm just going to skip tags and go straight to review and create and i'm just going to create that now again that's going to take a couple of moments to provision so once that's provisioned i'm going to open up the synapse analytics workspace and from there i'm going to create that data that table called housing data so that i can eventually ingest my data stored in that file share into my synthetic analytics so our sql pool just finished uh being deployed so let's go to our synapse analytics workspace i'm going to click go to resource and from here i'm going to go and open up the synapse studio now once our synapse studio analytics workspace has opened up let's go into the data section and open up databases and we can see that demo sql pool that we had just created so what i want to do is i actually want to go and create a table so i'm going to go to new sql script an empty script and what i want to do is i actually want to execute a table on this database so i'm going to go create a table i'm going to enter this predefined script that i had earlier so this is going to create the housing data table with fields like geo name month average price etc and i'm just going to hit run so we can see that query being executed here took you know um three second three millisecond to execute successfully now um our table is now inside our sql pull database so we can see it here now i'm going to go back to my azure portal and i am going to now search for my data factory that i had created earlier there we go adf demo housing as the data factory i had just created life i'm going to go into this and uh from here i'm going to go and click into author and monitor this is going to open up my azure data factory for me now from my azure data factory i can create pipelines i can create data flows i can create pipelines from a template i can copy data i can configure ssis integration or setup code repository so i'm going to copy data and i need to specify a couple of things so i'm going to call this my demo copied pipeline i can give it a description and i'm just going to select run once now hit next now i need to create a source data store so i'm going to create a new connection and i'm going to look up the file storage where i've uploaded my housing data as a file when i hit continue and i'm just going to call this my azure file storage i can give it a description um so i'm going to use the integration runtime and the way that i'm going to access it is i'm going to use the user connection string so from the azure subscription i need to pick my internal subscription there you go and i need to find that storage account name so i believe i called it a housing storage account and that's going to load that key i'm going to find that file share circle housing and then i can go create and that's going to create a connection to my azure file storage now i'm going to go next and then i'm going to go and find that file in my file sheet so i'm going to go browse and there you go houseprices.csv i'm going to choose that now um we can configure other settings but i'm just going to leave them for now and go next and now i just specify the file format settings so um it's going to be text format there's other formats that it supports because it's a csv file it's going to be a comma column delimiter and um i can configure other things i'm just going to go next we get to see a little preview of what that data looks like now i need to configure a destination data store so the destination is going to be our synapse so i'm going to go create a new connection and i'm going to look up my synapse that i had just created there we go synapse analytics and continue now i'm going to call this my adjacent nets analytics again i can give it a description i need to connect to my synapse so let's just look up the instance i created so i called it demo synaptic analytics and the database is that demo sequel pool now i also need to authenticate to it so i'm going to call this i'm going to authenticate to it and give it the password and then i can test connection here just to check that it is successful to connect to my synapse analytics i'm going to hit create and now i've created a connection to my address and that's analytics so i'm just going to hit next okay so now i need to map my uh essentially my csv data to my synapse analytics so let's see if we can use existing table and i'm going to select housing data so this is the one that i want to map the files in my azure file storage to the housing data table in my azure synapse analytics i'm going to go next and we can see some of that column mappings so everything in my housing data file is specified as a string type and then everything in my azure synapse analytics table that has its own type so varchar chart and and for the remaining columns i can also specify a pre-copy script in other settings as well i'm just going to leave that for now and go next here i can for example enable staging configure fault tolerance um and specify the copy myth method like polybase copy command or bulk insert i'm going to select bulk insert and i'm going to disable staging for now and i can also specify the degree of copy parallelism and here we can also specify the data integration units so for now it's auto so it's going to be automatically detected but we can also manually pick one of these for our data integration units i'm just going to leave it as order and i'm going to go next finally this is a summary of you know the copying process from azure file storage to our azure synapse analytics so once we review this we just go next and that is going to start that copying process so we're validating it creating data sets creating pipelines and now we're running that pipeline and it looks like we've just finished so let's just hit finish and we go back to our synapse analytics workspace here so now what i want to do is i just want to hit a little refresh let's do refresh then the starter should be copied into our demo sql pool database table so the housing data table so i'm just going to drill into the table and then from here i'm just going to let's check out the top 100 rows to see if the data has been successfully copied over and there you go we see that the data is now in our housing data table lesson four getting started with building uh power bi so microsoft power bi is a collection of software services apps and connectors that work together to turn your unrelated sources of data into coherent visually immersive and interactive insights so whether your data is simply held inside an excel spreadsheet or it's data that's inside either a cloud base or on-premises data warehouse power bi allows you to easily connect to any one of your data sources and visualize or discover what's inside the data and share that with anyone who wants it so power bi can be a very simple and fast way to create quick insights from your excel spreadsheets or local database or it could be used to create robust and enterprise-grade models that supports real-time analytics so you can use power bi to create you know personal reports and visualization but it can also serve as an analytics and decision engine behind group projects divisions or an entire organization so power bi consists of a couple of components so there is the power bi desktop and the power bi servers as well as power bi mobile app so the power bi desktop is a win a microsoft windows desktop application power bi service is an online sas software as a service service and the power bi mobile app is essentially a mobile app that runs on either windows ios or android mobile operating systems so each of these three elements so desktop servers and mobile apps are designed to let people create share and consume business insights in the way that serves them the best so for example you might be a user who view reports and dashboards in power bi service or you might be the one who are creating those number crunching business reports through power bi desktop and then publishing those reports to either power bi servers or the power bi mobile app so that other users can view and consume and use them so that you know one of your colleagues say for example in sales might then use the power bi mobile app to then you know monitor the progress of say for example their sales quota and drill into new sales leads and all that kind of stuff now power bi generally follows a common flow so the common flow of activities looks something like this so you might bring data into power bi desktop and create a report then you publish to the power bi service where you can then create a new visualization or build dashboards from and then you can share these dashboards with other people typically through either the power bi mobile app so that's people can actually or use this can actually view and interact with these shared dashboards and reports so as a power bi user you might spend all your time in the power bi servers viewing the visuals and reports that's been made by others um or you might be a user that spends all their time in power bi desktop creating and building those reports out as well or you could be a user that only interacts with the power bi mobile app to get those insights and drill down into some of the information presented by some of the results that's been visualized or created into a report through power bi now everything you do in microsoft power bi can be broken down into a few basic building blocks so these building blocks are visualizations data sets reports dashboards and tiles so a visualization also called a visual is a visual representation of data like a chart or a color-coded map or any other interesting things that you can create to visually represent your data so power bi has all sorts of visualization types that you can use now visualizations can be quite simple so it could be a single digit number or a single number that represents something significant or they could be quite visually complex like a gradient colored map that shows voter sentiments about a certain social issue or concern the main goal of a visual is really to present data in the way that provides context insights both that could otherwise be quite difficult to gather from a raw table of numbers or texts now data sets is a collection of data that power bi uses to create its visualizations so you can have a simple data set that's based on a single table from microsoft excel spreadsheet datasets can also be a combination of many different sources so what you can filter and combine to provide a unique collection of data or a data set for use in power bi so for example you can create a data set from three database fields for example one website table an excel table and and some online results through maybe a marketing campaign through email so combining all these data sources into one data source and power bi is considered a single data set now in power bi you can also create a report which is a collection of visualizations that appear together on the same page or pages and just like any report that you might have ever created either for you know a sales pitch or uh for school assignments a report in power bi is simply a collection of items that relate to one another and so reports lets you create many visualizations uh either in a single page or multiple pages and lets you arrange those visualizations in whatever way that you think would best tell a story with the data and with the information and insights that you've gathered and so you might have a report about quarterly sales you might have a report about product growth in a particular area you can even create a report on the migration of polar bear patterns if you wanted to so whatever subject that you need to create a report on power bi allows you to gather and organize all your visualizations into one or more pages now with dashboards when you're ready to create to share a single page from a reports or a collection of visualizations you can create a dashboard to do so so a dashboard is very similar to a dashboard that you might have seen in a car for example so a power bi dashboard is a collection of visuals from a single page that you can then share with others and often it's selected groups of visuals that provide very quick insights into the data or the story that you're trying to present with the data and so a dashboard must fit into a single page often this is called a canvas so a canvas is essentially a blank backdrop in the power bi desktop or servers where you can put visualizations onto so you can think of it like the same canvas that an artist or a painter might use to create their art or paint um so a dashboard is that workspace where you can create and combine and rework some of those very interesting and compelling visuals and put together a story into a sort of single page dashboard that you can then share with others so that they can view and interact with it either through the power bi service or through the power bi mobile app now tiles is uh in power bi a single visualization on a report or a dashboard so it's typically a rectangular box that holds an individual visual so when you're creating a report or dashboard and power bi you can move or arrange the tiles however you want you can make them bigger you know you can change the height or the width you can you can sort of group them with other tiles if you wanted to now when you're viewing or you're consuming either a dashboard or a report which means you're not typically the creator or the owner of that particular report or dashboard but someone has actually created it and shared it with you you can interact with it but you can't typically change the format of some of these tiles so you can't change the size of the tiles or the arrangement of these tiles so just to sum it up power bi is a collection of services apps and connectors that lets you connect to your data wherever it might happen to reside and you can filter do some filtering if necessary and you can then bring it into your power bi to create those compelling visualizations and dashboards and reports and uh story with your data that you can then share with others so for some people you know using a single excel spreadsheet or table and a data set and then sharing that through a dashboard with the team um could be you know one way to use power bi for others the value of power bi might be actually using that using real-time azure sql database tables combining that with other databases and real-time sources and building that sort of moment by moment data set and so so whatever group you might fit into the process is the same we can use power bi to create data sets to build compelling visuals and share it with others and the result is also the same for both groups so you can really harness the ever-expanding world of data and turn it into actionable insights so whether your data insights require straightforward or complex data sets power bi supports that and power bi can help you very quickly get started and expand as your knees and your scope expands and as that complexity in your data expands as well and because power bi is a microsoft product you can count on the fact that it's a robust extensible and microsoft friendly product that is enterprise ready wow so this concludes the training for microsoft virtual training day for data fundamentals so let's have a look at everything we've covered in this training we started with module 1 where we covered core data concepts then we had a look at module 2 which explored relational data in azure and then we had a look at module 3 where we explored non-relational data in azure as well as module 4 we had a look at some of the data analytics solutions in azure so thank you for taking part in both part one and part two of this microsoft azure virtual training day for data fundamentals now just remember you can continue your learning journey we have a great variety of resources such as the microsoft learn platform where we offer self-paced learning paths for you to take your learning journey further so i wish you all the best for your exams and happy studying [Music] so [Music] [Music] [Music] [Music] so [Music] you
Info
Channel: arogyamai
Views: 2,287
Rating: undefined out of 5
Keywords:
Id: MnWAWDeZ3Mk
Channel Id: undefined
Length: 254min 33sec (15273 seconds)
Published: Fri Sep 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.