Amazon Redshift Tutorial | AWS Redshift Training | Intellipaat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys welcome to this session by Intellipaat Amazon redshift is a cloud data warehousing tool which is available in AWS and it is the fastest cloud data warehousing tool available companies like lyft SoundCloud and Philips or few among the thousands of companies who use redshift for their data warehousing needs and it is very easy to get started with redshift and also guys at the end of this session you will be able to start with redshift independently and also before moving on with this session please subscribe to our channel so that you don't miss our upcoming videos now let us take a quick glance at the agenda we'll start off with a quick introduction to redshift and its use cases after that we look at redshifts architecture and its components moving on we'll look at the concepts of red shift and its integration with other services and finally after learning all of these concepts we'll do a hands-on on redshift and understand it even better now without any further delays let us begin with this session now let us see the data warehousing tool which is provided by AWS that is Amazon redshift so Amazon redshift is a fully managed petabyte scale data warehouse service in the cloud so you can scale this particular service up to petabytes so you can start with just a few hundred gigabytes of data and scale to a petabyte or more this enables you to use your data to acquire new insights for your business and customers so it provides you so much of storage you can store a lot of data and also it has a different kind of an architecture which I'll show you using that architecture you can query results faster and yeah you can acquire new insights of your business and customers as soon as possible after the data is loaded and yeah so now Amazon redshift service manages all the work of setting up operating and scaling a data warehouse it takes care of everything because that's what AWS does for us and these tasks include provisioning capacity monitoring your monitoring you know if we get metrics for almost every service then backing up the cluster we can backup using snapshots automated and applying patches and upgrades to the Amazon redshift engine yeah this also AWS they do this for us and so yeah so 3 into faster than other cloud data warehouses so they give us the confidence that this is three times faster than the other data house is available and yeah so performance mass matters and Amazon redshift is the fastest cloud data warehouse available a redshift is the world's fastest cloud data warehouse and gets faster every year for performance intensive workloads you can use the new RA 3 instances to get up to 3x or the performance of any data cloud data ware house I think this was announced in the recent reinvent event of AWS and also this aqua at once query accelerator also was announced and this will be available in 2020 to the 2020 not now so acquires a new distributed and hardware accelerated cache that allows redshift to run up to 10 into faster than any other cloud data warehouse so they have this system they are building the system I think it will be available in 2020 not right now and also this aqua system when they build it it will be compatible with the older redshift versions also yeah so yeah this is the basic basic information and then one more is that it is 50% less expensive than all other cloud data warehouses this is another promise they make usso Amazon redshift costs less to operate than any other cloud data warehouse start small at 0.25 dollars per hour and scale up to peta bytes for under thousand dollars per byte per terabyte a you pay only for what you use so that's exactly what every cloud service does for us you only pay for what we use and know how much you'll spend with predictable monthly costs Amazon redshift is at least 50 percent less expensive then all the other cloud data warehouse there's telling at least so it might be even lesser expensive scale and pay for storage compute separately and get the optimal amount of storage and compute for diverse workloads so basically here you can instead of choosing one instance which are both covers the storage and compute they give you options of selecting or storage and compute separately so you can provide according to your needs choose the size of your red shift cluster based on the performance requirements and only pay for the storage that you use the new managed storage automatically scales your data warehouse storage capacity without you having to add and pay for additional compute instances so the new manage stories so basically what they are saying is it automatically scales a data warehouse storage capacity for us yeah so this is automatically done we don't need to provision or we don't need to do anything and yeah so these are the use cases these are the two use cases provided the first one is business intelligence and the second one is operational analytics on business events so first business intelligence a redshift makes it simple and cost effective to run high for formance queries on petabytes of structured data so that you can build powerful reports and dashboards using your existing BI tools so I ask you can run high-performance queries on petabytes of structured data and you can use that data for creating reports and you can use you can connect it with your existing BI tools and you can get insights out of it and operational analytics on business events bring together structured data from your data warehouse and semi structured data such as application logs from your s3 data link to get real-time operational insights on your applications and systems so they are telling we can combine both structured and semi-structured from two different services one is the data warehouse service one is the data Lake service and we can gather operational insights on our applications yep so these are the two use cases and this I wanted to show that's it this is nothing important so these are few companies yeah amazon uses redshift and so what they have said is Amazon redshift has been a key component of our successful migration of Oracle and has enabled us to significantly reduce the cost of our analytics while scaling to support one of the largest analytical workloads in the world we all know Amazon I think permanent $300,000 of transactions happen Amazon so lot of products being bought so yeah it is a very huge workload for them so for that they're using redshift and it's working well for them and then coming to McDonald's Amazon redshift enables faster business insights and growth and provides an easy to manage infrastructure to support our data workloads redshift has given us so this is just something they're just talking something about rift so these are few companies which use redshift right now okay so now we've come yeah so that was just the introduction of redshift what what is redshift it is a data warehousing tool provided by Amazon and the few companies using it and those are the use cases or you can use it for analytics and you can use it for getting operational insights and you can use it as the data provider for your Bi tool yeah so that's what we saw now let us look at the concepts of Amazon redshift so this is the architecture which they provide us AWS provide us so this is the architecture of Amazon redshift so these are the client applications and so here this is the data warehouse architecture of redshift so they have a leader node and they have multiple compute you can have multiple compute nodes and all those compute nodes have node slices also so I'll explain all of these concepts right now so this is the basic structure and they provide you two different connectivity options one is JDBC one is saw ODBC this for open connectivity and this is for Java connectivity and you can use them to connect with your client applications so this is the basic architecture of redshift data warehouse now so client applications in the sense Amazon redshift integrates with various data loading and ETL tools and bi tools so you can see BI tools reporting data mining and analytics tools it integrates well with all of these tools and it is based on industry standard Postgres equal so basically the enter rich shift structure is built on a post grace equal database so that's what they have used so most existing is equal client applications will work with only minimal changes so I think you can use this with your my seagull workbench or SQL server or other clients so I think they work with it and also yeah so I said it's built with post K sequel you can also use the Postgres SQL client so yeah this is these are the client applications and connections Amazon redshift communicates with client applications by using industry standard JDBC and ODBC drivers they provide both you can use the either of them and clusters so clusters they are the core components and so the leader node compiles code for individual elements of the execution plan and assigns the code to individual come compute nodes so for example if a huge query if a huge code comes in so leader node takes in takes it it splits it up into a smaller workload a smaller queries or smaller tasks it provides it to the available compute nodes respectively that the leader node decides how to provide it so basically this concept is used so that the time it takes to query something reduces and so you don't need to pay anything for the leader node you just pay for the number of compute nodes you create and yeah a cluster is composed of one or more compute nodes so basically this is so this part is the cluster it can have one or more compute nodes and then yeah if a cluster is provisioned with two or more compute nodes an additional leader node coordinates the compute nodes and handles external communication for example if you have only one compute node if you create it with just one compute node there is no need of a leader node to distribute work among the other compute nodes because there is only one node and all the work will be assigned to that so there is no use of leader node so what if you assign more than two or two compute nodes then a leader node is automatically assigned and all the incoming queries all the incoming transactions are taken care of for leader node takes care of it it divides into multiple smaller tasks and provided to the compute nodes accordingly and your client application interacts directly one with leader node the compute nodes are transparent to the external applications so these client applications do not contact the compute nodes available if you have two or more nodes two or more compute nodes then your client applications only point of connectivity is the leader node not the compute nodes so all the requests first go to the leader node then it goes to the compute node so that is saw what they are saying and the next point yeah the leader node coming to the leader node and the compute nodes so the leader node manages communications with client programs in all communication with compute nodes yeah it takes care of all the communication between the client programs and the compute nodes it stands in the middle it's like a middleman and it parses and develops execution plants to carry out database operations so any kind of a huge database operation will be taken care of at the leader node the leader node will assign the work to the compute nodes accordingly so yeah especially the series of steps necessary to obtain their cells for complex queries so if you enter if you hit in a complex query it first goes to the leader node here it is the first point of contact for any clients application so the leader node now splits it up accordingly it splits it up in multiple parts and provides it to other compute nodes which do their part and provide the results back and the leader node aggregates them into a single result and sends it back to the client application so this is why we need a leader node based on the execution plan the leader node compiles code distributes the compiled code to the compute nodes and assigns a portion of the data to each compute node so according to the execution plan the leader node will compile the code then distribute the compiled code to the compute nodes so now coming to compute node part so the leader node compiles code for individual elements of the execution plan that might be a huge execution plan which is currently executing and assigns the code to individual compute nodes yeah so the compute nodes execute a compiled code and send intermediate results back to the leader need a node for final aggregation so the code will be sent to the compute nodes the compute nodes will run the queries after running the queries all that the results will be sent back to the leader node which will aggregate them which will finally aggregate them and send it back to the client application so each compute node has its own dedicated CPU memory attached disk storage which I determined by the node type so there are multiple node types also that will see while creating the rich of cluster so basically every compute node will have its own CPU its own memory its own disk storage attached to it and Amazon redshift provides two node types dense storage nodes and dense compute nodes each node provides two storage choices okay so basically these are different storage types basically node type in the sense in the previous sentence instance classes we saw instance classes right like that there are multiple node types also yeah so they have different specifications that also we'll see so there are two types of nodes here one is then storage nodes and one one is dense compute nodes so I think the dense compute notes will have a lot of computing power and the dense storage nodes are is used for scaling up your storage whenever you need or scaling down yeah I think you cannot scale down yes so scaling up your storage whenever you need it so these are two different nodes which are entirely separate and these are assigned separately as I told you these are separate the computer and the storage and these storage can be auto scaled accordingly AWS does it for us and same goes with compute but you will have to change the instance type accordingly and a compute node is partitioned into slices each slice is allocated a part of the nodes memory and disk space where it processes of portion of the workload assigned to the node so even the compute node has multiple node slices within it so you can see this is the storage area so all of these nodes slices has their own compute power have their own storage area so now this compute node will be provided with a simpler query now this compute node it will split that query into multiple parts if it requires multiple parts if it cannot handle the workload in just one node slice so what it does it separates it and provides it to multiple layer node slices which so these note slices are responsible for executing those queries or executing those operations after they execute them so the data obviously will be available in the storage area it gets that back and sends it back to the leader node for final aggregation so these are note slices they are smaller parts of a computed which have their own CPU so they have their own memory and they'll disk space where they process a portion of the workload assigned to them so the node slices will be assigned a portion of the workload which is signed to the compute node and they execute them so these are compute nodes okay now these are the basic concepts of redshift architecture one is the connection one is the cluster under cluster there are leader nodes in computes a leader node is a single point of contact for the client applications then compute nodes you have you can have a number of compute nodes we see how much so if you have two or more compute nodes then a leader node is automatically created and the leader node is responsible for splitting up huge workloads into smaller workloads and providing it in the compute nodes on the compute nodes have their own sub versions those are note slices and these note slices have their own memory and storage space and the workload assigned to the compute node will be split up into smaller workloads again given it is given to the node slices they execute their part give it back to the compute node and the compute own you will sit back to the leader node for final aggregation and that finally collected result is sent back to the client application so this is the architecture of redshift now performance so why does sorry I have a question yes so the leader node like is I'm not the only one like is it just one single point of failure or could you have multiple leader knows like unload balance it come later so I think a leader node is more of a logical thing I think it's not like there is no separate compute head for that the compute head or whatever you assign I think they take it and create our leader node accordingly because when you create a compute node just one compute node there is there are no leader nodes but I think actually there is workload management so basically what you'll do is follow that yeah so you can provide a particular compute node a particular type of workload but I think coming to the single point of failure part so right now the basic concepts in this architecture one is the client application which is totally away from the redshift cluster then inside the relative cluster if you have multiple compute nodes you will have a leader node if you do not have multiple compute nodes if you have a single compute node then you will have no leader nodes because there is only one compute node and inside compute nodes you have node slices and they have their own memory and storage okay now coming to the performance of redshift so there are six factors which they climb to be available in redshift which makes redshifts performance faster because they told a w s-- promises that they that is three times faster than any other cloud data warehouse so redshift is the fastest data warehouse available so you can see these are the options one is massively parallel processing second one is columnar data storage third one is data compression fourth query optimizer then result caching and compiled code ok so now let us saw discuss all of these concepts so first massively parallel processing so massively parallel processing enables fast execution of the most complex queries operating on small sorry operating on large amounts of data multiple compute nodes handle all query processing leading up to a result aggregation in which each core of each node executing the same compiled query segments on portions of the entire data so this is what massively parallel processing is the same thing which was explaining about leader what layer node does so basically my massively parallel processing enables fast execution of complex queries but why does it do that because it splits it up it as he told you it the leader node splits up the workload provides it to the provides it to multiple compute nodes and the processing is happening parallely and all of this results are coupled up together and then provided so that is what MPP that is massively parallel processing and second one is columnar storage so I think there are two types one is row based which happens in a traditional databases the database blocks will have two KB to three 32 KB but in Amazon redshift a block size can go up to 1 MB which is more efficient and further reduces the number of input-output requests needed to perform any database loading or other operations that are part of query execution so in redshift any data coming in will be automatically your taken into a column your storage basically all that for example in a block normally in a normal database if in a row based database this block will not have only this column's data so block one will have data of this particular row but in redshift block one will have column of it will have the data of this entire column in one block so basically if there are six columns then there will just be six blocks of which have all the column data in it so I think for a analytic engine for example for redshift for a data warehousing tool columnar storage provides better performance so we can see here columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall IO requirements and reduces the amount of data you need to load from this so for example if it is a row based storage then for each row there will be one block so this will basically not take up a lot of storage but it will take a lot of memory so basically there will be lot of blocks so to take from for example if you just want to gather this particular column data you will not be so basically you will have to go through all the blocks only if you go through all the blocks then you can get all the data back but in a data warehousing tool basic thing is for analytics so for you need large amounts of data you need large amounts of data to analyze something so let's say you want to calculate the average age so let's say there are 100,000 records of age so if you try to do that if you try to analyze that then if it is a row based storage it will go into every block and gather that particular data and then it will do the process but if it is a if it is a columnar storage then it just takes this block so there will be one single block and it just uses that so I think this reduces the input requirements and also it fastest so it increases their performance so as told by redshift so redshift uses coloumnor storage and because of that they are telling it is three times faster than using a row based or database for analytics operations it is three times faster so if it is three times faster than a row based database basically it requires three times less for example one by third of the memory and the compute gap capability of a row based storage used for analytics so this is where columnar data storage comes in so this is the basic example they provide and then data compression data compression reduces storage requirements thereby reducing disk input/output which improves query performance so data compression yeah so if you compress the data it will reduce storage requirements so when you exceed a query the compressed data is read into memory then uncompressed during query execution so loading less data into memory enables Amazon redshift to allocate more memory to analyzing the data so basically for loading data so if it is a compressed so when that data is loaded into memory it is not taking up a lot of first memory or area so now a red shift can allocate the remaining memory for analyzing that data so once this compressed data is put inside the memory then before query execution it is uncompressed and then that data is used to analyze so basically this takes lesser memory so that's why they use data compression so these are default concepts which they automatically using redshift we don't need to enable these so these are what they automatically use in redshift and then query optimizer the Amazon redshift query execution engine incorporates a query optimizer that is massively parallel processing aware and takes advantage of the columnar oriented data storage so as I told you red shift has both massively parallel processing and a columnar data storage so there's a query execution engine specially designed to take advantage of this columnar data storage mpp so there is a separate query optimizer inside redshift which is running for us and this is what this is what it does and then result caching so to reduce query execution time and improve system performance Amazon redshift caches the results of certain types of queries in memory on the leader node so for example for let's say multiple people are trying to analyze some particular query data so they are querying the same thing from the same redshift cluster again and again so this query data or it should automatically stores it in the cache in the leader node so once it stores that then every time it can be taken with it it can be retrieved faster than going back to the compute nodes then querying it back getting it back then providing the final result so it also does a result caching so when a user submits a query Amazon redshift checks the result cache for the valid or cached copy of the query result if a match is found in the result cache Amazon redshift uses the cached resource and does not execute the query so when you run that query once again basically it will not run the query execution does not happen so this will improve the performance so it just takes the result from the cache and give it gives it back so the query is not performed it is not the workloads are not assigned to the compute nodes and then compiled code so the leader node distributes fully optimized compiled code across all of the nodes of the cluster compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed especially for complex queries ok so basically the leader node whenever there is a query coming in automatically combines it first and it gives that compiled code so I cannot make sense out of it but this is what it does so it if you provide a particular set of a query set so in that there are multiple queries running continuously so it executes that executes in the sense it compiles it in it interprets whether there is there are any errors of whatever and then it provides it divides it into multiple workloads and provides it to the compute nodes so these are the six major concepts for improving redshift performance and these are already implemented in redshift there is no need to trigger them there is no need to do anything yeah and so workload management all wlm so flexibly manage priorities within workloads so that short fast running queries won't get stuck in queues behind long running queries so you can set up for example you can see over here concurrency three or use a group and so this spider will see so 35% memory so if this particular query is using 35 percent of memory then it will be taken up by this queue if it is taking up 40 percent of memory it will take up this queue i I don't know much information about this so right now I just know this you can create manual queues also so you can provide 5% memory then anything below 5% and yeah anything below 5% up till 5% will be in that particular queue and anything below that 5 % will be in this queue like for example smaller queries this do not need a lot of memory they have they occur faster so for those fast running queries you do not need to assign a lot of memory you can just create a manual queue which will automatically execute for us so if you provide 5% all the queries under 5% memory usage will be taken care of this queue and so same thing you can create multiple queues and the Amazon redshift wlm creates query queues at runtime according to service classes so which define the configuration parameters for various types of queues including internal system queues and user accessible queues so you can create multiple I think you can create 50 you know the exact number you can create that and also their secure system implemented inside the rich of cluster it is an automatic queue I think again you'll have to set it up so there are two types one is automatic queuing system one is this manual queuing system and launching a redshift cluster so I launched one so this will how it'll look and also there are some bugs on the console I don't know whether in this system I'll just check that so now so this was my cluster name cluster one then this was the endpoint which they provided you can use this endpoint to connect with your SQL client yup so you can do that then cluster one multi nodes here it was multi node because I chose two nodes and I chose the node type as DC 2 dot large and I didn't provide any preference it took us east one F and created time yeah maintenance track me PC cluster state as it was available it was healthy it was not in maintenance mode yeah something like this so you get also other options I'll also show that we look into all the concepts in the dashboard will understand that completely and then querying a database using the query editor because I will not be using SQL client I will be using the query editor provided by redshift and also I have some sample data to load I also have some sample queries to execute so I'll be doing that and then query editor considerations this once we complete the demo then we'll come back here and look look at that and also the query editor allows only these or that is DC point eight into large these this is the previous generation DC 2 dot large DC 2.8 takes large dears 2.8 X large is allowed not ds2 not large and then are a 3.6 X large is load so basically there are two node types which is not allowed for using the query editor so we'll be using DC 2 dot large so no worries we'll be able to access the query editor yeah so now let's go over to the console so this is the dashboard and so there is this bug guy so basically also never if you're in middle of some process if you are creating the cluster if you have typed something don't click on this X it goes over something happens so if you choose anything you don't get this part I don't know why so yeah so this is the basic dashboard right now so launch cluster and query editor these are the two main services they provide in redshift this is for launching a cluster this is for querying the cluster we created so these are the two services and find the best cluster configuration for your needs so this is so tool they provide this is a service so if you want a petabyte scale then you can choose here and you can choose the number of petabytes you need so 5 or 1 or 4 so 3 let's say 20 again so here we they provide you a cluster name they provide you the amount they also show this so result for 3 years it's to $18 this owed for 1 year it is 500 1.7 a $6 per our but ha this is not for you this is for all and on demand it is 870 dollars per hour so this is one tool which care which you can use for GB 20 GB dc-to-dc 2 dot large this is the node type we'll be using today so for this if you reserve it for one year then every our it is 300 point three two dollars if you reserve it for three years a breed or its point one a dollar so this is saw actually we very economic if you reserve it for three years if you are using it for three years you can directly launch a cluster from here after choosing it so like this you can choose it from here and you can just select it so for TV so it provide gives you recommended options so you can see recommended DC 2.8 into large D is 2.8 into large so the price also varies it also shows you the exact amount for the storage which you take up a maximum storage for the option like for example the large option yeah I think thousand petabyte for so this is the maximum capacity that is 128 nodes 2048 T be compressed to orig this compressed storage so basically you can scale up to thousand DB here and then 16 terabytes per node you'll have to 128 node you pree provided with 16 TB every note at the start so I think this is the largest one Oh so though as the bar change is the option to changes okay so if I go to GB it provides me DC 2x large no there is one more instance it's not shown here there is so that is the largest instance available but here the biggest one is they suggest only this the largest one is I think are a three dot that I'll show now while launching the cluster so then then you have this cluster snapshots or these are the basic things so I created one so there is one parameter group three events again events and even subscriptions and these are the other options this is for clusters query editors so if you run a query and if you want to save that query you can save it in saved queries you can just there is an option for sale just hit save it will be saved here I think I have saved one query yeah so there is one query and then snapshot I have not taken any snapshot and I have not gone through security then workload management as I told you there they have introduced newly introducing auto wlm so Auto wlm uses machine learning algorithms to actively manage your cluster resources by calculating optimal query cur concurrency and memory allocation real-time auto wlm provides maximum performance for your cluster workload so to start with Auto wlm choose switch W Illumina wlm configuration okay so I think normally you get this result Oh Clawd manager so workload manager optimizes query performance for allocating clusters to query memories yeah so it creates my you can create multiple queues as I told here so that is what they are depicting data coming in you can occur you can manage your workload manager and you can create multiple queues okay so that's fine I'm not going to go inside that right now then reserve nodes these are as I was showing you it showed on demand pricing then reserved for one year and for three years you can purchase reserved nodes here and I think advisor is a tool like the recommendation engine for the Aurora cluster for Aurora cluster if you create another cluster you have an option of recommendations here if you click on that Ella bliss or they automatically provide you some recommendations which you can implement or you don't need to then events so you can see creating where they went yeah so even to two zero zero one 2003 so even redshift has its own different types of events and I think you can so this is basically informational there is no severity so basically if it failed lower maybe that will create a event and it will provide a severity like a risk or high or something over here and this is for creation this is for created this is for deleted that's it and yeah so this is connect client so this is a for connecting client you can try these partner tools which they suggest so I think these are those tools I don't know so I think you can use these tools which they are saying they're partnered or you can just download the SQL sequel workbench and you can select the JDBC or you can select the ODBC driver and download it from here itself and so I don't have any clusters right now so you will have to choose one you will have to choose the cluster and you can download it and you can connect it accordingly so you get configurations over here so I will not be connecting with Jeremy's your ODBC and I will be using the query editor okay so now let us start off with clusters let us launch a cluster so again there are two ways either you can go with quick launch or launch cluster so if I provide quick launch it is just one page or you can choose the number of nodes here one or two or three or four and you can see if I choose to 0.3 to TP storage available so if I provide for 0.6 for if I provide remember it increases accordingly because its 0.16 per node and also you are provided with the instances over here the type of the node types so TC 2 dot large DC 2.8 X large DC so these are the things we discussed and also they provide the pricing here now I think this is better better looking than our rota they don't provide the pricing for the instance classes and the storage they provided per node for DC 2 dot large its 0.16 full or TB per node this is 2.5 6 per node 2 DB per node to 16 GB per node so these are dense storage engines these are dense compute engines so these are 4 compute so they have lesser storage and so accordingly if they have less and storage it is an optimal they have optimal memory and here the storage is very high you can store a lot of freedom in these 2 node types you can choose them from here after that you will have to provide a name then provide a database name and provide the port number this is the default it automatically takes it and then master username you will have to provide the passwords and if you are going to use any other service so for example I am going to show you how to load from s3 so you'll have to choose one of the roles I'll also show what kind of a role to create so first let us create the cluster and then we'll move on to that part so either you can use the quick launch or you go again go to that one settings again if you have typed something if you close this then this particular page goes off so ok then yeah I'll just shift to at one settings so here you will have to configure one by one first you'll have to prove the cluster details then you'll have to provide the node configuration then some additional configurations like I think security groups or something okay so now first cluster identifier I'll provide cluster 1 I don't have a cluster then the database name so optional this is optional this is not mandatory you can see there is no star here or default database named Deb is created for the cluster optionally specify a custom database example my db2 create a additional database so a database is already created when you normally clear create a cluster if you want I'll do an additional database before launching the cluster then you can provide the name over here and then after creating the cluster you can create any number of databases inside that so I'll provide a name let's say so Deb so I'll just create one let's say test database name test so I'll just give test one I don't know I don't know whether there is a test already inside ok so cluster one test one port number is 5 4 3 9 master user I'll just go with admin for password admin 1 2 3 but I think you'll have to provide a combination so it has to have one uppercase letter and also one number so they've provided that here so I'm just going to make my a capital so still it's admin 1 2 3 so done so now I'll continue now here you will have to choose the node type so here you can see in quick launch you did not see this node type or you only saw a DC 2 dot large DC 2.8 X large ds2 dot large dh2 DirectX class these throw T these two are the previous generation node types and this is the newest one yeah so I'm just going to go with this e 2 dot large and again you also choose this e 2 dot large and here seven easy to compute units 2 which will cause so for node 4 node it provides seven different computing units as I told you there will be notes slices so these are them also there will be I think seven node slices accordingly and then memory 15 point 2 5 gigabyte per node that is so I think yeah so 15 point 2 5 gigabyte per node and then 160 GB SSD storage per node yeah moderate and then it can be a single node single or there is one option one or multi node the maximum number of nodes you can provide for ADC to dot large instance type is 32 so I'll just go with two I don't want a huge instance right now so 2 is fine so next you can also read this so compute node store your data and execute your queries in addition to your computer it's a leader node will be added to your cluster free of charge the leader node is the access point for the ODBC / JDBC and general it's the query plants executed on the compute node so also one of you asked whether need a node is only one is the is that might be a single point of failure that also will look into right now then continue so default redshift there is one parameter group already existing it automatically took that as he told you a parameter group would be a set of flags they'll all be flags you'll just have to provide the value so it automatically takes a default one if you want to create your own one you can yeah you can create your you can edit this I think and then database encryption you can either go with none you don't need right and right now I do not need but you can go with kms they encrypt it or with their own default key which is available in AWS you don't need to do anything is it encrypted for this particular account which is my account so I'll just go with none right now and then V PC default is automatically taken availability zone you can provide one you can just provide no preference it automatically takes one and then cluster subnet group default again because I chose the default to a PC and the default V PC has a default subnet group it is automatically taken publicly-accessible yes it should be publicly accessible only then I'll be able to connect with SQL client and then choose a public IP address so here what does they mean is they provided elastic IP address for this so if I click yes you can if you already have a elastic IP available in your AWS console you can choose if you do not have just provide no they will create on automatically and they'll assign it to assign it to the redshift cluster and then enhance the PC routing so basically if you click on enhance V PC routing all the queries all the data going in and out will go through only with all the data coming in or going out will go through one particular V PC so you'll have to do much more configurations for that you will have to create a V PC which does not allow or data to go out it should not be publicly accessible and all that so that's what vb0 outing basically means and then list of V PC security groups to associate with this cluster associate your cluster with one or more security group so this side thing you cannot select anything it is automatically it automatically picks it up but you don't need to provide anything because as I'm choosing default V PC it automatically takes my default security group we does not take any other security group if you provide another V PC if that V PC is combined with a particular security group then yes you can that will automatically be taken but right now there is default yeah so this will be automatically taken for me because that is my default security group and then you can create a cloud watch alarm directly from here you can choose so these are the only options provided in this console there is 70 till 95 and use existing topic so you guys all know what is the topic you just basically use to send email notifications and use existing topic in the sense you have to choose one and that's it or you can create one from here so disk usage threshold threshold at which the alarm will trigger when this cues is across all notes Reece breaches this percentage so when the disk usage in all nodes are going to reach 80 percent then this alarm will send to send an email to this particular topic so I don't need a lamb right now and then my maintenance or track current trailing and private beta so if you choose private beta you can basically preview the features but you cannot so I think don't choose this either choose current or choose trailing so current a select current to apply the latest certified maintenance release which includes the latest features and the latest bug fixes if you choose trailing it applies the previous that is the previous certified maintenance release the one before this so that is what trailing and this is was current you can just go with current for the latest so if you want to attach a I am role right now you can just directly go I already have one I can attach this or I can attach it later so I like that later so right now let us launch this cluster so these are the we provided all the configurations that is more than enough and now unless you are in an eligible for free drill you will start occurring charges as soon as your cluster is active so as I told you if you have not launched so where is that take you and show it I just want to make you sure or so try Amazon Richard for two months free if your organization has never created an Amazon redshift cluster organization in the sense your AWS account if you have never created a rich of cluster in your AWS account you are eligible for of two month free trail or for DC to dot large node yeah you get 750 hours per month enough force to continuously run one DC to large node with 160 GB so this basically 750 ours from per month is for one node so if you run two nodes that has will be divided between those two nodes if you run three it will be divided between those three four accordingly so it goes on up till 32 so if you run 32 nodes for these do dot large then that 750 hours will get it'll get over soon so 750 divided by 32 that's it yeah so that's it so what else who is eligible so you are eligible for the free trail if your organization has not created an Amazon redshift cluster since July 1st 2014 so if you have not created a cluster since 2014 July 1st then you are eligible for this free trail program yeah so I think most of you would have not created a cluster so I think you'll be eligible for this free trail yeah so others are the same they state the same fact I just wanted to show that and yep so the if you do not have free trail if somebody does not have a free trail then the on demand hardly rate for this cluster will be 0.5 dollars or 0.2 $5 per node so for this there are two nodes so plus a free leader node so forth the leader node you don't pay anything and 0.25 dollars if you have purchased to reserve the nodes in this region for this node type that our active your costs will be discounted so basically if you have reserved if you have bought a reserved node from here so then in this region so in North Virginia region if you have bought a node then that particular nodes charges will be discounted accordingly and then there is no charge for the leader node if you are eligible for free trail yeah it's the same thing so I will just launch the cluster so now let us view all the clusters I think it would have been created it's creating yeah it's still creating so once it creates we'll be able to see inside that we'll be able to see the end points and stuff let's wait until it gets created so fucking it's available right now so the cluster is available right now so three events happened third created Oh this was back so right now it's created okay could I know the cluster state is available the DB health is still unknown it will check for the DB health and should tell us it will also give us the detail whether it is in maintenance or not so it will tell it's in maintenance or it will provide not in maintenance or something like that and then coming down so right now they have provided a in point so yeah so you'll have to use this end point to connect with your SQL client and then plus the properties you have the name it's multi node node type - it got created in 1f the time and everything the cluster version oh it's up to date the latest version that's what I told because if you maintenance track current the latest resealed sorry the latest release will be used and then V PC then cluster subnet groups V PC your parameter group so I don't have any I am group still now cluster state is available so database else is also provided right now it's healthy it's not in maintenance mode okay fine so now database properties the port number is five four three nine it is publicly accessible we created a database name called a via created a database called a test one and then we created master username admin it's not encrypted so they've provided for both so basically a JDBC redshift : the end point with the database name for ODBC driver server name database just one UID admin password you'll have to enter the master password and port number so this is the URL for JDBC and ODBC and then yeah so this is backup audit logging and maintenance allow version upgrade yes yeah and here comes capacity details TC 2 dot large then we have total of seven easy to compute units that is 2 which will CPU is per node so according to that there are seven easy to compute units totally and 15.2 5 gigabytes per node 160 gb SSD storage per node moderate 64 mb and this is the public key of the cluster I think we can use this for Isis H so logging in with that then then they provided the node IP addresses so the little nodes IP address and private IP is this the compute nodes public and this is compute 0 compute 1 so you can have up to compute 32 or sorry I think it'll be a 31 yep compute Z Row 1 I can be it one so yeah so public IP public IP and private eye piece of these created nodes so now as one of you ask are the what will happen if there is a failure of the leader node that will check will go to the FAQ of Amazon redshift in case of node failures Amazon redshift automatically provisions new nodes and begins restoring data from other drives within the cluster or from Amazon s3 it prioritizes your prioritizes restoring your most frequently query data on your most frequently executed queries will become performant quickly okay so now let us just check this this is more informaiton than just so reading that metadata so I think over here viability and durability so what happens to my data arrows cluster availability and data durability if we drive on one of my node fails so this does not answer it yeah so this is the individual node failure so individual node failure in the sense Amazon redshift will automatically detect and replace a failed node in your data warehouse cluster so these this is automatically done if there is a failure in the in soul for example we have two nodes the node has been further then there is a node failure so yeah it will be replaced the data warehouse cluster will be unavailable for queries so that particular node if one node fails it will be unavailable for queries and updates until a replacement notice provision so your enter Barrow's is unavailable when there is a node failure occurring and is provisioned and then added to the database amazon richard makes your replacement node available immediately and loads your most frequently accessed data from s3 so are the most data free the reader which is frequently accessed that data will be taken and first stored and later all the data is loaded back single node clusters do not support data replication in the event of drive failure you will need to restore the cluster from a snapshot on s3 we recommend using at least two nodes for production so yup so this is at least telling us to use two nodes so if even now so I think this is just for one node so if one node fails event of individual node failure then your cluster will not be available so if you have two nodes if one node fails at least that one node will be working and then this is for the entire availability zone has an outage so if your Amazon redshift data warehouse cluster the entire available availability zone is having an outage if it is completely an available so you will not be able to use your cluster until power and network access is free to restored to that availability zone so yeah your database data warehouse clusters data is preserved so that you can use you can start using your Amazon rich of data barrows as soon as the availability zone is available so the data is preserved the error is not lost if window availability zone comes back up online then that data again a new one is created or already the the existing cluster is provided it will be available right now so you can start using it and in addition you can also choose to restore any existing snapshots yeah so if you created any snapshot you can use that to create one more cluster and Amazon redshift will restore your most frequently accessed data first so you can resume queries as quickly as possible so I think yeah so they were not provided just over here you will not be charged only compute nodes will incur charges it will be built for okay how do I load data inside okay we have the single leader node no they provide it can I access compute nodes directly no your red shift compute nodes are in a private network space and can only be accessed from your data warehouses clusters leader node this provides an additional layer of security I think there is no other go you will have to use leader node so I think even if there is a failure they will provide a new one immediately there is no other strategy either you can have another cluster so after that like when there is a when your cluster one of the cluster goes down you can have another cluster in another region and you just use that until this availability zone comes up so I think there is no part of having a lis like there is a little node failure or something so even if the cluster endpoint is not an available then I think you'll have to wait until it comes back up that's the only chance here but if the node goes down then if you have two nodes one node will be working if until the other node comes down this node will be used if you have multiple nodes then it's better if you have a single node if the node goes down then yes your entire class we don't know queries or nothing can be nothing will be running any queries coming in will be a cancel any plan correction will be cancelled so yeah so that's what if there is a case of leader node failure so you you cannot do anything else until they are you can just wait until it comes back up that's it yeah so right now we have created a cluster okay so coming back here okay so we have created a cluster guys so that's it it's all you'll have to provide the information and that's it so now we have a database named as test one so I'll show how to connect this database from your query editor so now I'll open query I'll open query right over here so when you click on query editor it shows this so this was for the previous cluster it's still trying so now I'll just close this so cluster 1 cluster 1 is the cluster name which I have right now so the supported nodes for query editor are these for DC 1.8 x-large DC 2 dot large DC 2.8 X large ps2 rotate X large tears 2 dot large is not you cannot use that particular node type in query editor and then you can also you or you can also use the RA 3 instead node type you can also use that it's not provided here but as I read the documentation it was provided there it isn't the PowerPoint also so yeah so now to login into this database it's pretty simple enter the database name so you can enter the database name either dev or test one so test one our dev is already created test one is the database we created boil creating the cluster so now hit test 1 provide the database name there is username I have only admin and then password and connect so now it is connected that's it so the year you can see the viable databases or yeah so this is the console this is how it looks Pro X the cluster name it provides the database name so here yup so if you click on the database if you click on the database name you will be shown the available databases so you have dev here you have another database I don't know what is this so you have these three databases here right now and you can shift if you click them and if you hit connect you'll be connected with dev and you can do that accordingly so now I'll close so here you can type in a query and run that query so let's create a database here create database or say dev - and run the query so query completed in point eight six one seconds statement completed successfully a statement completed yeah so right now so I think I'll have to wait for some time so we've created the database dev - but why is it not available let me check once again yeah so now it is available dev - we've created a database so this is how it's this how simple it is used query editor over here so that's it so now guys so now let us do a simple demo with that is I'll create a role I'll attach it to this redshift cluster so what that role basically does it does this is it provides the permissions to access is three for this particular rich of rich of cluster so once I provide this from this rich of cluster we can query data which is told in s3 so I've already stored some data in s3 and I'll also show from where to get that sample data there is a tutorial I followed which is AWS again I just follow AWS official white papers and documentation so from there you can get the sample data and you can upload it to an s3 packet from there you can you can provide the I am role to this particular edge of cluster after providing it you can just create and create tables here after creating tables load data into this database after loading data you can query the data suit so yeah so let us start with that so for that first I'll after you open I am so first I'll have to create a role as I told you I'll have to create a role so role create role or the service which I have to provide the permissions to is redshift I'll have to choose so red shift shift is over here so I'll be choosing redshift customizable there is one thing called scheduler and this is not for clusters this is for clusters so I'll just choose this and then here you will have to provide the permissions for redshift accessing s3 so redshift does not need any other axis it just needs the axis to read objects inside is three so let's see yeah so here is three reader so there is this that is Amazon s3 read-only access so this will allow the redshift cluster to only read the data from history so I'll just click here next tags if you want you can add a tag let's say name redshift then review provide a name hello redshift to s3 and yeah so this allows rich of clusters to call aw services on your behalf and then create roll so the role yeah so the role is created so I'll just copy this vini we will need this errand ahead in this demo so I'll just copy this so this is one I'll just open another save it here so now we've created this so we are going to attach this role to our redshift cluster so I'll go to my cluster Oh to I can more marriage here itself manager a.m. rolls choose a role this is the role I created right now that is a low redshift to h3 click on it and then apply changes yeah so now it's modifying so we have attached a I am role to our cluster so that part is done so the next part because we are going to lower taking data from s3 so now data should be available in s3 so I'll open the three console so I've already uploaded the files inside s3 I'll show which pocket so it is Amazon or a bucket the bucket we created long back and I've created a folder inside that data and inside that all the text files are available so those text files I have the zip file of that and I'll also show which documentation to go for like which documentation you can go through and get the you can get the load this particular files the zip file and also from where you can get the queries so yeah so right now we have attached it to the cluster also we have the data inside our s3 your database so it is over here yeah so week I created a folder uploaded the data into it so right now my s3 is set data is ready now I just have to run the queries accordingly to create tables after creating tables I'll have to load that data inside those tables that will be my second part after loading data inside that table after that I'll have to query start querying so right now let's just show that so first I'll be creating these tables after creating these tables so I'll be copying so these are some more queries so you would have thought why I copied the AR n so basically I'll have to copy this errand and paste it here so this will show so basically the query editor does not have access so for that we'll have to provide this so that the s3 bucket knows that so this particular was a I am role so basically it allows the redshift cluster to access the data from history so while providing this Aaron the certificate will automatically understand we can provide the data to it so this is that I also paste these in the end and then these are some sample queries which was viable so yeah so done so now let us start off with this process so first I'll go to my query editor and then I'll just do this and do this yeah so now I have my query editor and here you can see the query results so now yeah so now first we'll have to create tables so first I'll create this table run query so completed and pound eight seconds so done I've created one table and now I'll load the data for that particular table so now I'll just copy this I can just paste it in this query and hit run or I can create a new one and hit run so I'll just run here what is this oh sorry haha I did not change the roll what's the copy user so these are some for for query editors these are the commands for copying users in the sense oh the table names yeah ok ok ok ok so now I forgot to change my iron that is Amazon resource number of my I am role which I just attached to my rich of cluster so I will just paste it all over my queries you yeah so that's done okay so now I think it'll work so I saw the query has run successfully so now we can just use this one query to do all the operations or we can open another query and do so we can draw do this so ice also as a best practice in the documentation they told run one query in one query page so one complete query in one query page that I don't know why but that is what they have told sometimes when you run multiple queries there are some errors occurring here it's showing some some kind of intonation errors or something so that's what it was showing here so run one query at a time then but you can run a query in the same query dashboard multiple times so I'll just upload the content of this done then category so just wait I'll just do finish this off discreet queries based run and then date access for the done even you it's done then there are two more listing and then says you so every time you open it shows the results just taking some time to show okay so we've uploaded all the content inside our tables but one more thing is you'll not be able to view that content here that is the only problem it is just to query in query editor you will not be able to view the data you stored so for that you'll have to use in SQL client but you will be able to run queries and then see the data so yeah that you can do but there is no option for viewing all the tables available like that so you will just have to query all of them and see here yeah so if it is a client or a workbench or something like that you have everything you have your own GUI which you can use to see the tables available or the databases available and everything but in the query editor you can only query that's the only option here so now we have uploaded the content so I'll also show what was in the content so there was a thing the name of it was ticket yeah so these are the files they're all text file so here you can see just to say it's loading up so the continuous available so you can see the number code or something whatever is that name my name last name so whatever order and then the email the number some questions to true false true true false true Suffolk okay so so you can see there is a lot of data available so these many queries sorry this one here rose so basically it will not be uploaded as single rose so if it is uploaded as single rose every time it will take a lot of time so what they do is they consider this entire column as 1 Row 1 1 1 1 so they do that and they upload it so basically it happens in redshift and it happens transparently which are the shorter chipped promises they use columnar data storage so I just wanted to show this I also will give the link for downloading this sample data which you can use for practicing yeah so now we have uploaded all the content we have loaded the content from s3 into redshift so now we will have we just do some querying and practice it so sample query is so this one is to get the table definition I think table definition of table name sales shit up so table devotion definition over here so you can see the query results over here and also you can download the CSV file from here and just save and you can download it and you can also view the results here oh yeah there is some problem with this so you can also download the CSV file and view it and that's it and then another query select some lists from sales state where sales to date ID is equal to date dot date ID so they are just doing that and calculating on the date on this date they are calculating the sum of quantity sold so now when I run this query I should get a number of sum so yes so we've got that number so that is the only result we get because that is only the self we needed so this is one query and then and then there is this so here they've provided the limit as ten okay so so we'll get 10 rows of output so showing rows 1 to 10 so you can see first name last name total quantity they've provided the name and last name and the total quantity so that's worth mentioned here select buyer ID some quantity sold total quantity from sales group by buyer ID or to buy total quantity limit so we'll just change this limit to 200 I've just changed they provided hundred rolls here and then another hundred rows in the next column so I think 100 rows is the limit for one particular result column and you can also download this entire CVC CSV file and just look at the results for yourself okay so so they're provided here completely yeah so this is one query I'm just doing some sample queries so to show how it works so then select another query same thing so here so that you can see the query results are faster even though there is there's not a lot of data but still we can consider but when there is really a large amount of data and you want to query some results they happen very fast because as I told you that's what that's how redshift has been a architectured he does one little or add the principal compute nodes which distributes the workload and accordingly complex queries are desert divided into simpler queries and they are executed and the results are clubbed in linear ode and that final aggregation is sent to the client tools so yeah so we've done query editor also and one more one question about I am database authentication and so I tried so basically instead of you'll not be able to assign roles to your own from set up but you'll you can use secret and secret access key and your access key to authenticate for that particular I am users you can provide programmatic access and those users can use their secret and access keys to enable I am database authentication but you can only use that for RDS instances RDS databases because only there you will be able to enable it after that you can just use that endpoint and yeah you can use the security credentials to log in but you'll not be able to use I am database application on your server but you'll be able to use for is that your secret access key and access key and also the token and do that so that is the only option okay so now coming back here so saved query so coming back to the query editor I forgot to show one thing so for example if you think you'll run this query often so you can just save it here you can just save it here provide a query name let's say this is uh I don't even know what query is this so I'll just give query 1 and again query one I had query so done query saved uh yeah so so here so now go to saved queries so this is saved query download enter SQL a text so you can download this query sighs SQL text or you can copy the entire SQL text and just ah if I just paste it here so we'll get it so this is this pretty easy can just save your queries over here and get them whenever you need so also this is easy to control you can just search what query you need accordingly and get it and then yeah so this is pretty much saved queries that's it okay guys we've come to the end of this session I hope this session on Amazon redshift was helpful I informaiton for you if you have any queries regarding this session please leave a comment below and be allowed to help you out thank you
Info
Channel: Intellipaat
Views: 21,397
Rating: 4.5537848 out of 5
Keywords: amazon redshift tutorial, redshift tutorial, aws redshift, aws redshift tutorial, aws redshift architecture, aws redshift tutorial for beginners, amazon redshift demo, redshift tutorial for beginners, aws redshift demo, amazon redshift tutorial for beginners, what is amazon redshift, amazon redshift, amazon redshift overview, amazon web services, amazon redshift configuration, redshift, azon redshift architecture, amazon aws redshift tut, Intellipaat, amazon redshift spectrum
Id: CyMA7bJdVfI
Channel Id: undefined
Length: 79min 4sec (4744 seconds)
Published: Fri Mar 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.