Amazon Redshift Tutorial | AWS Redshift Training | Amazon Redshift Data Warehouse | Intellipaat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey guys welcome to this session by Intel apart Amazon redshift is a cloud data warehousing tool which is available in AWS and it is the fastest cloud data warehousing tool available companies like lyft SoundCloud and Philips or few among the thousands of companies who use redshift for their data warehousing needs and it is very easy to get started with redshift and also guys at the end of this session you will be able to start with redshift independently and also before moving on with this session please subscribe to our channel so that you don't miss our upcoming videos now let us take a quick glance at the agenda we'll start off with a quick introduction to redshift and its use cases after that we look at redshifts architecture and its components moving on we'll look at the concepts of fridge shift and its integration with other services and finally after learning all of these concepts we'll do a hands-on on redshift and understand it even better now without any further delays let us begin with this session now let us see the data warehousing tool which is provided by AWS that is Amazon redshift so Amazon redshift it's a fully managed petabyte scale data warehouse service in the cloud so you can scale this particular service up to petabytes so you can start with just a few hundred gigabytes of data and scale to a petabyte or more this enables you to use your data to acquire new insights for your business and customers so it provides you so much of storage you can store a lot of data and also it has a different kind of an architecture which I will show you using that architecture you can query results faster and yeah you can acquire new insights of your business and customers as soon as possible after the data is loaded and yeah so now Amazon redshift service manages all the work of setting up operating and scaling a data warehouse it takes care of everything because that's what AWS does for us and these tasks include provisioning capacity monitoring yeah monitoring you know if we get metrics for almost every service then backing up the cluster we can backup using snapshots autumn the backups and applying patches and upgrades to the Amazon redshift engine yeah this also AWS they do this for us and so yeah so 3 into faster than other cloud data warehouses so they give us the confidence that this is three times faster than the other data house is available and yeah so performance mass matters and Amazon redshift is the fastest cloud data where else available redshift is the world's fastest cloud data warehouse and gets faster every year for performance intensive workloads you can use the new are a three instances to get up to 3x the performance of any data flow data we're also I think this was announced in the recent reinvent event of AWS and also this aqua at once query accelerator also was announced and this will be available in 2020 to the 2020 not now so acquires a new distributed and hardware accelerated cache that allows redshift to run up to 10 into faster than any other cloud data warehouse so they have this system they are building the system I think it will be available in 2020 not right now and also this aqua system when they build it it will be compatible with the older rich shift versions also yeah so yeah this is the basic basic information and then one more is that it is fifty percent less expensive than all other cloud data warehouses this is another promise they makers so Amazon redshift costs less to operate than any other cloud data warehouse start small at 0.25 dollars per hour and scale up to petabytes $400,000 per byte per terabyte a you pay only for what you use so that's exactly what every cloud service does for us you only pay for what we use and know how much you'll spend with predictable monthly costs Amazon redshift is at least 50 percent less expensive then all the other cloud data where else there's telling at least so it might be even lesser expensive scale and pay for storage compute separately and get the optimal amount of storage and compute for diverse workloads so basically here you can instead of choosing one instance which are both covers the storage and compute they give you options of electing or storage and compute separately so you can provide according to your needs choose the size of your chip cluster based on the performance requirements and only pay for the storage that you use the new managed storage automatically scales your data warehouse storage capacity without you having to add and pay for additional computing census so the new manager is so basically what they are saying is it automatically scales a data warehouse storage capacity for us yeah so this is automatically done we don't need to provision or we don't need to do anything and yeah so these are the use cases these are the two use cases provided the first one is business intelligence and the second one is operational analytics on business events so first business intelligence a red shift makes it simple and cost effective to run high-performance queries on petabytes of structured data so that you can build powerful reports and dashboards using your existing BI tools so I a city you can run high-performance queries on petabytes of structured data and you can use that data for creating reports and you can use you can connect it with your existing BI tools and you can get insights out of it and operational analytics on business events bring together structured data from your data warehouse and semi structured data such as application logs from your s3 data link to get real-time operational insights on your applications and systems so they are telling we can combine both structured and semi-structured from two different services one is the data warehouse service one is the data Lake service and we can gather operational insights on our applications yep so these are the two use cases and this I wanted to show that's it this is nothing important so these are few companies yeah my son uses redshift and so what they have said is Amazon redshift has been a key component of our successful migration of Oracle and has enabled us to significantly reduce the cost of our analytics why scaling to support one of the largest analytical workloads in the world we all know Amazon I think permanent 300 thousand dollars of transactions happen Amazon so a lot of products being bought so yeah it is a very huge workload for them so for that they're using redshift and it's working well for them and then coming to McDonald's Amazon redshift enables faster business insights and growth and provides an easy to manage infrastructure to support our data or workloads redshift has given us so this is just something there just talking something about rift so these are few companies which use redshift right now okay so now we've come yeah so the that was just the introduction of redshift what what is redshift it is a data warehousing tool provided by Amazon and the few companies using it and those are the use cases or you can use it for analytics and you can use it for getting operational insights and you can use it as the data provider for your bi tool yeah so that's what we saw now let us look at the concepts of Amazon redshift so this is the architecture which the providers AWS providers so this is the architecture of Amazon redshift so these are the client applications and so here this is the data warehouse architecture of redshift so they have a leader node and they have multiple compute nodes you can have multiple compute nodes and all those compute nodes have node slices also so I'll explain all of these concepts right now so this is the basic structure and they provide you two different connectivity options one is JDBC one is ODBC this for open connectivity and this is for Java connectivity and you can use them to connect with your client applications so this is the basic architecture of redshift data where O's now so client applications in the sense Amazon redshift integrates with various data loading an ETL tools and bi tools so you can see BI tools reporting data mining and analytics tools it integrates well with all of these tools and it is based on an industry standard Postgres equal so basically the entire redshift structure is built on a post clear sequel database so that's what they have used so most existing is equal client applications will work with only minimal changes so I think you can use this with your bicycle workbench or SQL Server or other clients so I think they work with it and also yeah so as it is built with post K sequel you can also use the Postgres equal client so yeah this is these are the client applications and connections Amazon redshift communicates with client applications by using industry standard JDBC and ODBC drivers they provide both you can use the either of them and clusters so clusters they are the core components and so the leader node compiles code for individual elements of the execution plan and assigns the code to individual CIM compute nodes so for example if a huge query if a huge code comes in so leader node takes in takes it it splits it up into a smaller workload a smaller queries or smaller tasks it provides it to the available compute nodes respectively that the leader node decides how to provide it so basically this concept is used so that the time it takes to query something reduces and so you don't need to pay anything for the leader node you just pay for the number of compute nodes you create and yeah a cluster is composed of one or more compute nodes so basically this is so this part is the cluster it can have one or more compute nodes and then yeah if a cluster is provisioned with two or more compute nodes and additional leader node coordinates the compute nodes and handle sectional communication for example if you have only one compute node if you create it with just one compute node there is no need of a leader node to distribute work among the other compute nodes because there is only one node and all the work will be assigned to that so there is no use of leader node so what if you assign more than two or two compute nodes then a leader node is automatically assigned and all the incoming queries all the incoming transactions are taken care of early leader node takes care of it it divides into multiple smaller tasks and provided to the compute nodes accordingly and your client application interacts directly only with leader node the compute nodes are transparent to the external applications so these client applications do not contact the compute nodes available if you have two or more nodes two or more compute nodes then your client applications only point of connectivity is the leader node not the compute nodes so all the requests first go to the leader node then it goes to the compute node so that is so what they are saying and the next point yeah the leader node coming to the leader node and the compute nodes so the leader node manages communications with client programs in all communication with compute nodes yeah it takes care of all the communication between the client programs and the compute nodes it stands in the middle it's like a middleman and it parses and develops execution plants to carry out database operations so any kind of a huge database operation will be taken care of at the leader node the leader node will assign the work to the compute nodes accordingly so yeah especially the series of steps necessary to obtain the cells for complex queries so if you enter if you hit in a complex query it first goes to the leader node yeah it is the first point of contact for any clients application so the leader node now splits it up accordingly it splits it up in multiple parts and provides it to other compute nodes which do their part and provide the results back and the leader node aggregates them into a single result and sends it back to the client application so this is why we need a leader node based on the execution plan the leader node compiles code distributes the compiled code to the compute nodes and assigns a portion of the data to each compute node so according to the exterior execution plan the leader node will compile the code then distribute the compiled code to the compute nodes so now coming to compute node part so the leader node compiles code for individual elements of the execution plan that might be a huge execution plan which is currently executing and assigns the code to individual compute nodes yeah so the compute nodes execute the compiled code and send intermediate results back to the leader need a node for final aggregation so these code will be to the compute nodes the compute nodes will run the queries after running the queries all that the results will be sent back to the leader node which will aggregate them which will finally aggregate then I'll send it back to the client application so each compute node has its own dedicated CPU memory attached disk storage which I determined by the node type so there are multiple node types also that will see while creating the rich of cluster so basically every compute node will have its own CPU its own memory its own disk storage attached to it and Amazon redshift provides two node types dense storage nodes and dense compute nodes each node provides two storage choices ok so basically these are different storage types basically node type in the sense in the previous sentence instance classes we saw instance classes write like that there are multiple node types also yeah so they have different specifications that also we'll see so there are two types of nodes here one is then storage nodes and one one is dense compute nodes so I think the dense comfy nodes will have a lot of computing power and the dense storage nodes are is used for scaling up your storage whenever you need or scaling down yeah I think you cannot scale down yes so scaling up your storage whenever you need it so these are two different nodes which are entirely separate and these are assigned separately as I told you these are separate the computer and the storage and these storage can be auto scaled accordingly AWS does it for us and same goes with compute but you will have to change the instance type accordingly and a compute node is partitioned into slices each slice is allocated a part of the nodes memory and disk space where it processes of portion of the work you're assigned to the node so even the compute node has multiple node slices within it so you can see this is the storage area so all of these nodes slices has their own compute power have their own storage area so now this compute node will be provided with a simpler query now this compute node it will split that query into multiple parts if it requires multiple parts if it not handle the workload in just one node slice so what it does it separates it and provides it to multiple layer node slices which so these note slices are responsible for executing those queries or executing those operations after they execute them so the data obviously will be available in the storage area it gets that back and sends it back to the leader node for final aggregation so these are note slices they are smaller parts of a compute which have their own CPU so they have their own memory and their disk space where they process a portion of the workload assigned to them so the node slices will be assigned a portion of the workload which is signed of the compute node and the executor so these are compute nodes okay now these are the basic concepts of redshift architecture one is the connection one is the cluster and the cluster there are later nodes in compute much a leader node is a single point of contact for the client applications then compute nodes you have you can have n number of compute nodes we see how much so if you have two or more compute nodes then a leader node is automatically created and the leader node is responsible for splitting up huge workloads into smaller workloads and providing it to the compute nodes on the compute nodes have their own sub versions those are note slices and these note slices have their own memory and storage space and the workload assigned to the compute node will be split up into smaller workloads again given it is given to the node slices they execute their part give it back to the compute node and the computer gives it back to the leader node for final aggregation and that finally collected result is sent back to the client application so this is the architecture of redshift now performance so why does I have a question yes so the leader node like exam that the only one like is it just one single point of failure or could you have multiple leader node like an load balance it come later so I think a little road is more of a logical thing I think it's not for like thousand there is no separate compute head for that the compute head or whatever you assign I think they take it and create our leader node accordingly because when you create a compute node just one compute node there is there are no leader nodes but I think actually there is workload management so basically what you'll do is follow that yeah so you can provide a particular compute node a particular type of workload but I think coming to the single point of failure part so right now the basic concepts in this architecture one is the client application which is totally away from the redshift cluster then inside the relative cluster if you have multiple compute nodes you will have a leader node if you do not have multiple compute nodes if you have a single compute node then you will have no leader nodes because there is only one compute node and inside compute nodes you have node slices and they have their own memory and storage okay now coming to the performance of redshift so there are six factors which they climb to be available in redshift which makes redshifts performance faster because they told a w s-- promises that they dynasts three times faster than any other cloud data warehouse so retrieved is the fastest data warehouse available so you can see these are the options one is massively parallel processing second one is columnar data storage third one is data compression fourth query optimizer then result caching and compiled code ok so now let us saw discuss all of these concepts so first massively parallel processing so massively parallel processing enables fast execution of the most complex queries operating on SMO sorry operating on large amounts of data multiple compute nodes handle all query processing leading up to a result aggregation in which each core of each node executing the same compiled query segments on portions of the entire data so this is what massively parallel processing is the same thing which was explaining about leader what layer node does so basically my massively parallel processing enables fast execution of complex queries but why does it do that because it splits it up it as I told you it uh the leader node splits up the workload provides it to the provides it to multiple computes and the processing is happening parallely and all of this results are coupled up together and then provided so that is what MPP that is massively parallel processing and second one is columnar storage so I think there are two types one is ro based which happens in a traditional databases the database blocks will have two KB to three 32 KB but in Amazon redshift a block size can go up to 1 MB which is more efficient and further reduces the number of input-output requests needed to perform any database loading or other operations that are part of query execution so in redshift any data coming in will be automatically taken into a column your storage basically all that for example in a block normally in a normal database if in a row based database this block will not have only this column's data so block 1 will have data of this particular row but in redshift block one will have column of it will have the data of this entire column in one block so basically if there are six columns then there will just be six blocks of which have all the column data in it so I think for a analytic engine for example for redshift for a data warehousing tool columnar storage provides better performance so we can see here columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall IO requirements and reduces the amount of data you need to load from disk so for example if it is a row based storage then for each row there will be one block so this will basically not take up a lot of storage but it will take a lot of memory so basically there will be lot of blocks so to take from for example if you just want to gather this particular column data you will not be so basically will have to go through all the blocks only if you go through all the blocks then you can get all the data back but in a data warehousing tool basic thing is for analytics so for analytics you need large amounts of data you need large amounts of data to analyze something so let's say you want to calculate the average age so let's say there are 100,000 records of age so if you try to do that if you try to analyze that then if it is a row based storage it will go into every block and gather that particular data and then it will do the process but if it is a if it is a columnar storage then it just takes this block so there will be one single block and it just uses that so I think this it reduces the in pull out requirements and also it fastest so it increases their performance so as told by redshift so redshift uses colander storage and because of that they are telling it is three times faster than using a row based or database for analytics operations it is three times faster so if it is three times faster than a row based database basically it requires three times less for example one by third of the memory and the compute cap capability of a rope a storage used for analytics so this is where columnar data storage comes in so this is the basic example they provide and then data compression data compression reduces storage requirements thereby reducing disk input/output which improves query performance so data compression ah yeah so if you compress the data it will reduce storage requirements so when you execute a query the compressed data is read into memory then uncompressed during query execution so loading less data into memory enables Amazon redshift to allocate more memory to analyzing the data so basically it for loading data so if it is a compressed so when that data is loaded into memory it is not taking up a lot of first bit memory area so now a redshift can allocate the remaining memory for analyzing that data so once this compressed data is put inside the memory then before query execution it is uncompressed and then that data is used to analyze so basically this takes lesser memory so that's why they use data compression so these are if world concepts which they automatically use in redshift we don't need to enable these so these are what they automatically use in red shift and then query optimizer the Amazon redshift query execution engine incorporates a query optimizer that is massively parallel processing aware and takes advantage of the columnar oriented data storage so as he told you rich shipped has both massively parallel processing and a columnar data storage so there is a query execution engine specially designed to take advantage of this columnar data storage on MPP so there is a separate query optimizer inside redshift which is running for us and this is why this is what it does and then result caching so to reduce query execution time and improve system performance Amazon redshift caches the results of certain types of queries in memory on the leader node so for example for let's say multiple people are trying to analyze some particular query data so they are querying the same thing from the same redshift cluster again and again so this query data or it should automatically stores it in the cache in the leader node so once it stores that then every time it can be taken with it it can be retrieved faster than going back to the compute nodes then querying it back getting it back then providing the final result so it also does a result caching so when a user submits a query Amazon redshift checks the result cache for the valid or cached copy of the query result if a match is found in the result cache Amazon redshift uses the cached resource and does not execute the query so when you run that query once again basically it will not run the query execution does not happen so this will improve the performance so it just takes the result from the cache and give it gives it back so the query is not performed it is not the workloads are not assigned to the compute nodes and then compiled code so the leader node distributes fully optimized compiled code across all of the nodes of the cluster compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed especially for complex queries ok so basically the leader node whenever there is a query coming in it automatically compiles it first and it gives that compiled code so I cannot make sense out of it but this is what it does so it if you provide a particular set of a query set so in that there are multiple queries running continuously so it executes that executes in the sense it compiles it in it interprets whether there is there are any errors of whatever and then it provides it divides it into multiple workloads and provides it to the compute nodes so these are the six major concepts for improving your chip performance and these are already implemented in redshift there is no need to trigger them there is no need to do anything yeah and so workload management or wlm so flexibly manage priorities within workloads so that short fast running queries won't get stuck in queues behind long running queries so you can set up for example you can see over here concurrency 3 or use a group and so this spider will see so 35% memory so if this particular query is using 35% of memory then it will be taken up by this queue if it is taking up 40% of memory it will take up this queue i I don't know much information about this so right now I just know this you can create manual queues also so you can provide 5% memory then anything below 5% and yeah anything below 5% up till 5% will be in that particular queue and anything below that if a person will be in this you like for example smaller queries do not need a lot of memory they have they occur faster so for those fast running queries you do not need to assign a lot of memory you can just create a manual queue which will automatically execute for us so if you provide 5% all the queries under 5% memory usage will be taken care of this queue and so same thing you can create multiple queues and the Amazon redshift wlm creates query queues at runtime according to service classes so which define the configuration parameters for various types of cues including internal system queues and user accessible queues so you can create multiple I think you can create I do not know the exact number you can create that and also there is a queue system implemented inside the rich of cluster it is an automatic you I think again you will have to set it up so there are two types one is automatic queuing system one is this manual queuing system and launching a redshift cluster so I launched one so this will how it will look and also there are some bugs on the console I don't know whether in this system I'll just check that so now so this was my cluster name cluster one then this was the end point which they provided you can use this end point to connect with your SQL client yup so you can do that then cluster one multi nodes here it was multi node because I chose two nodes and I chose the node type as DC 2 dot large and I didn't provide any preference it took us east one off and created time yeah maintenance track me PC cluster state as it was available it was healthy it was not in maintenance mode yeah something like this so you get also other options I'll also show that we look into all the concepts in the dashboard will understand that completely and then querying a database using the query editor because I will not be using SQL client I will be using the query editor provided by redshift and also I have some sample data to load I also have some sample queries to execute so I'll be doing that and then query editor considerations this once we complete the demo then we'll come back here and look look at that and also the query editor allows only these or that is DC point eight into large these this is the previous sir generation DC 2 dot large DC 2.8 takes large das 2.8 X large is allowed not ds2 dot launch and then RI 3 dot point six X largest load so basically there are two node types which is not allowed for using the query editor so we'll be using DC 2 dot large so no worries we will be able to access the query editor yeah so now let's go over to the console so this is the dashboard and so there is this bug guy so basically also never if you're in middle of some process if you are creating the cluster if you have typed something don't click on this X it goes over something happens so if you choose anything you don't get this part I don't know why so yeah so this is the basic dashboard right now so launch cluster and query editor these are the two main services they provide in redshift this is for launching a cluster this is for querying the cluster we created so these are the two services and find the best cluster configuration for your needs so this is so tool they provide this is a service so if you want a petabyte scale then you can choose here and you can choose the number of petabytes you need so 5 or 1 or 4 so 3 let's say 20 again so here we they provide you a cluster name they provide you the amount they also show this so result for 3 years it's to $18 the sowed for 1 year it is 500 1.7 a $6 per our but ha this is not for year this is for all and on demand it is 870 dollars per hour so this is one tool which care which you can use for GE 20 GB dc-to-dc 2 dot large this is the node type we'll be using today so for this if you reserve it for one year then every our it is 300 point three two dollars if you reserve it for three years agreed our it's point 1 $1 so this is actually very economic if you reserve it for 3 years if you are using it for 3 years you can directly launch a cluster from here after choosing it so like this we can choose it from here and you can just select it so for TV so it provide gives you recommended options so you can see recommended DC 2.8 into large be ace to rotate into large so the price also varies it also shows you the exact amount for the storage which you take up ssin is your maximum storage for the option like for example the large option yeah I think thousand petabyte 4 so this is the maximum capacity that is 128 nodes 2048 T be compressed to orig this is compare storage so basically you can scale up to thousand DB here and then 16 terabytes per node you'll have to 128 node you pree provided with 16 TB every note at the start so I think this is the largest one Oh so though as the bar change is the option to changes okay so if I go to GB it provides me DC to large no there is one more instance it's not shown here there is so that is the largest instance available but here the biggest one is they suggest only this the largest one is I think are a three dot that I'll show now while launching the cluster so then then you have this cluster snapshots or these are the basic things so I created one so there is one parameter group three events again events and even subscriptions and these are the other options this is four clusters query editors so if you run a query and if you want to save that query you can save it in the saved queries you can just there is an option for sale just hit save it will be saved here I think I've saved one Korea so there is one query and then snapshot I have not taken any snapshot and I have not gone through security then workload management as I told you there they have introduced newly introducing Auto wlm so Auto wlm uses machine learning algorithms to actively manage your cluster resources by calculating optimal query concurrency and memory allocation real-time auto wlm provides maximum performance for your cluster workload so to start with Auto wlm choose switch W Illumina mm configuration okay so I think normally you get this result 'lord manager so workload manager optimizes query performance for allocating clusters to query memories yeah so it creates my you can create multiple queues as I told here so that is what they are depicting data coming in you can occur you can manage your workload manager and you can create multiple queues okay so that's fine I'm not going to go inside that right now then reserve nodes these are as I was showing you it should on demand pricing then reserved for one year and for three years you can purchase reserved nodes here and I think advisor is a tool like the recommendation engine for the Aurora cluster for Aurora cluster if you create another cluster you have an option of recommendations here if you click on that AWS or they automatically provide you some recommendations which you can implement or you don't need to then events so you can see creating where are they once yeah so even to two zero zero one 2003 so even redshift has its own different types of events and I think you can so this is basically informational there is no severity so basically if it failed lower maybe that will create a event and it will provide a severity like a risk or high or something over here and this is for creation this is for created this is for deleted that's it and yeah so this is connect client so this is a for connecting client you can try these partner tools which they suggest so I think these are those tools I don't know where are they so I think you can use these tools which they are saying they're partnered or you can just download the SQL sequel workbench and you can select the JDBC or you can select the ODBC driver and download it from here itself and so I don't have any clusters right now so you will have to choose one you will have to choose the cluster and you can download it and you can connect it accordingly so you get configurations over here so I will not be connecting with Jeremy's your ODBC here and I will be using the query editor okay so now let us start off with clusters let us to Callisto so again there are two ways either you can go with quick launch or launch cluster so if I provide quick launch it is just one page or you can choose the number of nodes here one or two or three or four and you can see if I choose to 0.3 to TP storage available so if I provide for 0.6 for if I provide remember it increases accordingly because it's 0.16 per node and also you are provided with the instances over here the type of the node types so TC 2 dot large DC 2.8 X large DC so these are the things we discussed and also they provide the pricing here now I think this is better better looking than our aura they don't provide the pricing for the instance classes and the storage they provided per node for DC 2 dot large it's point one six point TV per node this is two point five six per node to DB per node to 16 GB per node so these are dense storage engines these are dense computer engines so these are for compute so they have lesser storage and so accordingly if they have lessen storage it is an optimal they have optimal memory and here the storage is very high you can store a lot of filler in these two node types you can choose them from here after that you will have to provide a name then provide a database name and provide the port number this is the default it automatically takes it and then master username you will have to provide the passwords and if you are going to use any other service so for example I am going to show you how to load from s3 so you'll have to choose one of the roles I'll also show what kind of a role to create so first let us create the cluster and then we'll move on to that part so either you can use the quick launch or you go you can go to that one settings again if you have type something if you close this then this particular page goes off so okay then yeah I'll just shift to at one settings so here you will have to configure one by one first you'll have to provide the study days then you'll have to provide the node configuration then some additional configurations like I think security groups or something okay so now first cluster identifier I'll provide cluster one I don't have a cluster then the database name so optional this is optional this is not mandatory you can see there is no star here or default database named Deb is created for the cluster optionally specify a custom database example my db2 create a additional database so a database is already created when you not really clear create a cluster if you want I'll do an additional database before launching the cluster then you can provide the name over here and then after creating the cluster you can create any number of databases inside that so I'll provide a name let's say so Deb so I'll just create one let's say test database name test so I'll just give test one I don't know I don't know whether there is a test already inside okay so cluster one test one port number is five four three nine master user I'll just go with admin for password admin one two three but I think you'll have to provide a combination so it has to have one uppercase letter and also one number so they provided that here so I'm just going to make my a capital so still it's admin one two three so done so now continue now here you'll have to choose the node type so here you can see in quick launch you do not see this node type or you only saw a DC 2 dot large DC 2.8 X large ds2 dot large ds2 DirectX class these through D these two are the previous generation node types and this is the newest one yeah so I'm just going to go with this e 2 dot large and again you also choose this e 2 dot large and here seven easy to compute units 2 which will cause so for node 4 node it provides seven different computer units as I told you there will be notes slices so these are them also there will be I think seven node slices accordingly and then memory 15.2 5 gigabyte per node that is so I think yeah so 15.2 5 gigabyte per node and then 160 GB SSD storage per node ah yeah moderate and then it can be a single node single or there is one option one or multi node the maximum number of nodes you can provide for ADC to dot large instance type is 32 so I'll just go with two I don't want a huge instance right now so 2 is fine so next you can also read this so compute node store your data and execute your queries in addition to your computer notes a leader node will be added to your cluster free of charge the leader node is the access point for the ODBC / JDBC and general it's the query plants executed on the compute node so also one of you as the weather needn't notice only one is the is that might be a single point of failure that also we'll look into right now then continue so default redshift there is one parameter group already existing it automatically took that as he told you a parameter group would be a set of flags they'll all be flags you'll just have to provide the value so it automatically takes a default one if you want to create your own one you can yeah you can create your you can edit this I think and then database encryption you can either go with none you don't need right and right now I do not need but you can go with kms they encrypt it uh with their own default key which is available in AWS you don't need to do anything is encrypted for this particular account which is my account so I'll just go with none right now and then V PC default is automatically taken availability zone you can provide one or you can just provide no preference it automatically takes one and then cluster subnet group default again because I chose the default way PC and the default V PC has a default supplement group it is automatically taken publicly accessible yes it should be publicly accessible only then I'll be able to connect with SQL client and then choose a public IP address so here what does they mean is they provided elastic IP address for this so if I click yes you can if you already have a elastic IP available in your AWS console you can choose if you do not have just provide no they'll create an automatically and they'll assign it to assign it to the redshift cluster and then enhanced VPC routing so basically if you click on enhanced VPC routing all the queries all the data going in and out will go through only with all the data coming in or going out will go through one particular VPC so you'll have to do multi configurations for that you'll have to create a V PC which does not allow or data to go out it should not be publicly accessible and all that so that's what vp0 outing basically means and then list of V PC security groups to associate with this cluster associate your cluster with one or more security group so this sizing you cannot select anything it is automatically it automatically picks it up but you don't need to provide anything because as I'm choosing default V PC it automatically takes my default security group we does not take any other security group if you provide another V PC if that V PC is combined with a particular security group then yes you can that'll automatically be taken but right now there is a default yeah so this will be automatically taken for me because that is my default a security group and then you can create a cloud watch alarm directly from here you can choose so these are the only options provided in this console there is 70 till 95 and use existing topic so you guys all know what is the topic just basically used to send email notifications and use existing topic in the sense you will have to choose one and that's it or you can create one from here so disk usage threshold threshold at which the alarm will trigger when this Q's is across all note series breaches this percentage so when the disk usage in all nodes are going to reach 80% then this alarm will send to send an email to this particular topic so I don't need a lamb right now and then my maintenance or track current trailing and private beta so if you choose private beta you can basically preview the features but you cannot so I think don't choose this either choose current or choose trailing so current a select current to apply the latest certified maintenance release which includes the latest features and the latest bug fixes if you choose trailing it applies the previous that is the previous certified maintenance release the one before this so that is what trailing and this was current you can just go with current for the latest so if you want to attach a I am role right now you can just directly go I already have one I can attach this or I can attach it later so I'll address that later so right now let us launch this cluster so these are the we provided all the configurations that is more than enough and now unless you're in an eligible for free trail you will start occurring charges as soon as your cluster is active so as I told you if you have not launched so where is that take you and show it I just want to make you sure so try Amazon Richard for two months free if your organization has never created an Amazon rich of cluster organization in the sense your AWS account if you have never created a rich of cluster in your AWS account you are eligible for of two month free trail or for DC to dot large node yeah you get 750 hours per month enough force to continuously run one DC to large node with 160 GB so this basically 750 ours from per month is for one node so if you run two nodes that Harz will be divided between those two nodes if you run three it will be divided between those three four accordingly so it goes on up till 32 so if you run 32 nodes for these two dot large then that 750 hours will get edited over soon so 750 divided by 32 that's it yeah so that's it so what else who is eligible so you are eligible for the free trail if your organization has not created an Amazon rich of cluster since July 1st 2014 so if you have not clearly created a cluster since 2014 or July 1st then you are eligible for this free trail program yeah so I think most of you would have not created a cluster so I think you'll be eligible for this free trail now so others are the same they stayed the same fact I just wanted to show that and yep so the if you do not have free trail if somebody does not have a free trail then the on demand hardly rate for this cluster will be 0.5 dollars or 0.25 dollars per node so for this there are two nodes so plus a free leader node so for the leader node you don't pay anything and 0.25 dollars if you have purchased two reserve nodes in this region for this node type that are active your costs will be discounted so basically if you have reserved if you have bought a reserved node from here so then in this region so in northridge in a region if you have bought a node then that particular nodes charges will be discounted accordingly and then there is no charge for the leader node if you are eligible for free trail yeah it's the same thing so I will just launch the cluster so now let us view all the clusters I think it would have been created it's creating yeah it's still creating so once it creates we'll be able to see inside that we'll be able to see the end points and stuff let's wait until it gets created I think it's available right now so the cluster is available right now so three events happened created Oh this was back souldn't right now it's created okay could I know that cluster state is available the DB health is still unknown it will check for the DB health and she tell us it will also give us the details whether it is in maintenance or not so it will tell it's in maintenance or it will provide not in maintenance or something like that and then coming down so right now they have provided a in point so yeah so you'll have to use this end point to connect with your SQL client and then cluster properties you have the name it's multi node node type - it got created in 1f the time and everything the cluster version it's up to date the latest version that's what I told because if you score maintenance track current the latest Raziel sorry the latest release will be used and then V PC then cluster subnet groups V PC your parameter group so I don't have any I am group still now cluster state is available so database else is also provided right now it's healthy it's not in maintenance mode okay fine so now database properties the port number is 5 4 3 9 it is publicly accessible we created a database name called a video created a database called a test one and then we created master user name admin it's not encrypted so they've provided for both so basically a JDBC redshift : the end point with the database name for ODBC driver server name database just one UID admin password you'll have to enter the master password and port number so this is the URL for JDBC and ODBC and then yeah so this is backup audit logging and maintenance illusion upgrade yes yeah okay and here comes capacity details TC 2 dot large then we have total of seven easy to compute units that is 2 which will CPU is per node so according to that there are seven easy to compute units totally and fifteen point two five gigabytes per node 160 GB SSD storage per node moderate 64 MB and this is the public key of the cluster I think we can use this for Isis H so logging in with that then then they've provided the node IP addresses so the leader nodes IP address and private IP is this the compute nodes public and this is compute 0 compute one so you can have up to compute 32 or sorry I think it'll be uh 31 yep complete zero one I can be it one so yeah so public IP public IP and private eye piece of these created nodes so now as one of you ask are the what will happen if there is a failure of the leader node that will check will go to the FAQ of Amazon redshift in case of node failures Amazon redshift automatically provisions new nodes and begins restoring data from other drives within the cluster or from Amazon s3 it prioritizes your prioritizes restoring your most frequently query data on your most frequently executed queries will become performant quickly okay so now let us just check this this is more informaiton than just reading that metadata so I think over here viability and durability so what happens to my data barrows cluster availability and data durability if you drive on one of my node fails so this does not answer it yeah so this is individual node failure so individual node failure in in the sense Amazon redshift will automatically detect and replace a failed node in your data warehouse cluster so these this is automatically done if there is a failure in the insole for example we have two nodes the node has been further there is a node failure so yeah it'll be replaced the data warehouse cluster will be unavailable for queries so that particular node if one node fails it will be unavailable for queries and updates until a replacement notice provision so your rent aware owes is unavailable then there is a node failure occurring and it's provisioned and then added to the database Amazon makes your replacement node available immediately and loads your most frequently accessed data from s3 so are the most data Frieda which is frequently accessed that data will be taken and first stole and later all the data is loaded back single node clusters do not support data replication in the event of drive fly failure you will need to restore the cluster from a snapshot on s3 we recommend using at least two nodes for protection so yeah so this is a clear telling us to use two nodes so if even now so that I think this is just for one node so if one node fails event of individual node failure then your cluster will not be available so if you have two nodes if one node fails at least that one node will be working and then this is for the entire availability zone has an outage so if your amazon touch of data warehouse cluster the entire I will availability zone is having an outage if it is are completely and available so you will not be able to use your cluster and tooth power and network access is free to restored to that availability zone so yeah your database data warehouse clusters data is preserved so that you can use you can start using your Amazon rich of data barrows as soon as the availability zone is available so the data is preserved the error is not lost if window availability zone comes back up online then that data again a new one is created or already takes the existing cluster is provided it will be available right now so you can start using it and in addition you can also choose to restore any existing snapshots yeah so if you created any snapshot you can use that to create one more cluster and Amazon redshift will restore your most frequently accessed data first so you can resume queries as quickly as possible so I think yeah so they've not provided chuck's over here you will not be charged to only compute nodes will incur charges will be billed for okay how do I load data inside okay we have the single leader node no world a provider can I access compute nodes directly no redshift compute nodes are in a private network space and can only be accessed from your data warehouses clusters leader node this provides an additional layer of security I think there is no other goal you'll have to use leader node so I think even if there is a failure they will provide a new one immediately there is no other strategy either you can have another cluster so after that like when there is a when your cluster one of the cluster goes down you can have another cluster in another region and just use that until this availability zone comes up so I think there is no part of having a lis like there is a leader node failure or something so even if the cluster endpoint is not an available then I think you'll have to wait until it comes back up that's the only chance here but if the node goes down then if you have two nodes one node will be working if until the other node comes down this node will be used if you have multiple nodes then it's better if you have a single node if the node goes down then yes your entire class we'll be done no queries or nothing can be nothing will be running any queries coming in will be cancelled any planned correction will be canceled so yeah so that's what if there is a case of leader node failure so you you cannot do anything else until they are you can just wait until it comes back up that's it yeah so right now we have created a cluster okay so coming back here okay so we have created a cluster guys so that's it it's all you'll have to provide the information and that's it so now we have a database named as test one so I'll show how to connect this database from your query editor so now I'll open query I'll open query right over here so when you click on query editor it shows this so this was for the previous cluster it's still trying so now I'll just close this so cluster 1 cluster 1 is the cluster name which I have right now so the supported nodes for query editor are these for DC 1.8 x-large DC 2 dot large DC 2.8 X large ps2 rotate X large tears to dot large is not you cannot use that particular node type in query editor and then you can also you or you can also use the RA 3 instead node type you can also use that it's not provided here but as I read the documentation it was provided there it is in the PowerPoint also so yeah so now to login into this database it's pretty simple enter the database name so you can enter the database name either dev or test one so test one our devils already created test one is the database we created boil creating the cluster so now hit test 1 provide the database name there is username I have only admin and then password and connect so now it is connected that's it so the year you can see the viable databases or yeah so this is the console this is how it looks Pro X the cluster name it provides the database name so here yup so if you click on the database if you click on the database name you will be shown the available databases so you have dev here you have another database I don't know what is this so you have these three databases here right now and you can shift if you click them and if you hit connect you'll be connected with dev and you can do that accordingly so now I'll close so here you can type in a query and run that query so let's create a database here create database or shade f2 and run the query so query completed in point eight six one seconds statement completed successfully a statement completely yeah so right now so I think I'll have to wait for some time so we've created the database dev - but why is it not available let me check once again yeah so now it is available dev - we've created a database so this is how it's this how simple it is used query editor over here so that's it so now guys so now let us do a simple demo with that is I'll create a role I'll attach it to this redshift cluster so what that role basically does it does this is it provides the permissions to access s3 for this particular rich shift rich of cluster so once I provide this from this rich of cluster we can query data which is stored in s3 so I've already stored some data in s3 and I will also show from where to get that sample data there is a tutorial I followed which is AWS again I just follow AWS official white papers and documentation so from there you can get the sample data and you can upload it to an s3 packet from there you can you can provide the I am role to this particular edge of cluster after providing it you can just create and create tables here after creating tables load data into this database after loading data you can query the data so yeah so let us start with that so for that first I'll after you open I am so create a role as I told you I'll have to create a role so role create role or the service which I have to provide the permissions to is redshift I'll have to choose the right shift the shift is over here so I'll be choosing redshift customizable there is one thing called scheduler and this is not for clusters this is for clusters so I'll just choose this and then here you will have to provide the permissions for redshift accessing s3 so redshift does not need any other axis it just needs to access to read objects inside s3 so let's see yeah so here is three reader so there is this that is Amazon s3 read-only access so this will allow the redshift cluster to only read the data from s3 so I'll just click here next tags if you want you can add a tag let's say name redshift then review provide a name hello redshift - s3 and yeah so this allows rich of clusters to call aw services on your behalf and then create road so the role yeah so the role is created so I'll just copy this vini we will need this errand ahead in this demo so I'll just copy this so this is one just open another save it here so now we've created this so we are going to attach this role to our chief cluster so I'll go to my cluster Oh to I think I can more marriage here itself managed I am roles choose a role this is the role I created right now that is a low redshift to h3 click on it and then apply changes yeah so now it's modifying so we have attached a I am role to our cluster so that part is done so the next part because we are going to lower taken data from is three so now data should be available in s3 so I'll open the a3 console so I've already uploaded the files inside s3 I'll show which pocket so it is Amazon or a bucket the bucket Vichy created long back and I've created a folder inside that data and inside that all the text files are available so those text files I have the zip file of that and I will also show which documentation to go for like which documentation you can go through and get the resist you can get the load this particular files the zip file and also from where you can get the queries so yeah so right now we have attached you to the cluster also we have the data inside our s3 your database so it is over here yeah so week I created a folder uploaded the data into it so right now my s3 is set data is ready now I just have to run the queries accordingly to create tables after creating tables I'll have to load that data inside those tables that will be my second part after loading the data inside that table after that I'll have to query start querying so right now let's just show that so first I'll be creating these tables after creating these tables so I will be copying so these are some more queries so you would have thought why I copied the AR n so basically I'll have to copy this here and paste it here so this will show so basically the query editor does not have access so for that we'll have to provide this so that the s3 bucket knows that so this particular was a I am role so basically it allows the redshift cluster to access the data from history so while providing this Aaron the certificate will automatically understand we can provide the data to it so this is that I also paste these in the end and then this is some sample queries which was viable so yeah so that so now let us start off with this process so first I'll go to my query editor and then I'll just do this and do this yeah so now I have my query editor and here you can see the query results so now yeah so now first we'll have to create tables so first I'll create this table run query so completed and pound eight seconds so done I've created one table and now I'll load the data for that particular table so now I'll just copy this I can just paste it in this query and hit run or I can create a new one and hit run so I'll just run here what is this oh sorry haha I did not change the role what what's the copy user so these are some for for query editors these are the commands for copying users in the sense oh the table names yeah okay okay okay okay so now I forgot to change my iron that is Amazon a resource number of my I am role which I just attached to my rich of cluster so I'll just paste it all over in my queries you yeah so that's done okay so now I think it'll work so I saw the query has run successfully so now we can just use this one query to do all the operations or we can open another query and do so we can draw do this so ice also as a best practice in the documentation they told run one query in one query page so one complete query in one query page that I don't know why but that is what they have told sometimes when you run multiple queries there are some errors occurring here it's showing some some kind of intonation errors or something so that's what it was showing here so run one query at a time then but you can run a query in the same query dashboard multiple times so I'll just upload the content of this then category so just wait I'll just do finish this off discreet my little queries based run and then date except for the done even you so done then there are two more listing and then says just create one walk you so every time you open it shows the taking some time to show okay so we've uploaded all the content inside our tables but one more thing is you'll not be able to view that content here that is the only problem it is just to query in query editor you will not be able to view the data you stored so for that you'll have to use in SQL client but you will be able to run queries and then see the data so yeah that you can do but there is no option for viewing all the tables available like that so you will just have to query all of them and see here yeah so if it is a client or a workbench or something like that you have everything you have your own GUI which you can use to see the tables available or the databases available and everything but in the query editor you can only query that's the only option here so now we have uploaded the content so I'll also show what was in the content so there was a thing the name of it was ticket yeah so these are the files they're all text file so here you can see just to say it's loading up so this the continuous available so you can see the number code or something whatever is that name my name last name so whatever order and then the email the number some questions to true-false true-false joseph like okay so so you can see there is a lot of data available so these many queries sorry this one heroes so basically it will not be uploaded as single rows so if it is uploaded as single rows every time it will take a lot of time so what they do is they consider this entire column as 1 Row 1 1 1 1 so they do that and they upload it so basically it happens in redshift and it happens transparently which are that's what redshift promises they use columnar data storage so I just wanted to show this I also give the link for downloading this sample data which you can use for practicing yeah so now we have uploaded all the content we overloaded the content from s3 into redshift so now we'll have we just do some querying and practice it so sample query is so this one is to get the table definition I think table definition of table names sales up yeah so table devotion definition over here so you can see the query results over here and also you can download the CSV file from here and just save and you can download it and you can also view the results here oh yeah there is some problem with this so you can also download the CSV file and view it and that's it and then another query select some this from sales state where sales dot data ID is equal to date dot date ID so they are just doing that and calculating on the date on this date they are calculating the sum of quantity sold so now when I run this query I should get a number of sum so yes so we've got that number so that is the only result we get because that is only itself we needed so this is one query and then and then there is this so here they provided the limit as ten okay so so we'll get ten rows of output so showing rows 1 to 10 so you can see first name last name total quantity they've provided the name and last name and the total quantity so that's worth mentioned here select buyer ID some quantity sold total quantity from sales group by buyer ID or to buy total quantity limit so we'll just change this limit to 200 I've just changed Oh they provided hundred rolls here and then another hundred rolls in the next column so I think 100 rows is the limit for one particular little column and you can also download this entire CVC CSV file and just look at the results for yourself okay so so they have provided here completely yeah so this is one query I'm just doing some sample queries so to show how it works so then select and again another query same thing so here so that you can see the query results are faster even though there is there's not a lot of data but still we can consider but when there is really a large amount of data and you want to query some results they happen very fast because as I told you that's what that's how redshift has been a architectured it has one little aura and the principal compute nodes which distributes the workload in accordingly complex queries are desert divided into simpler queries and they are executed and the results are clubbed in later node and that final aggregation is sent to the client tools so yeah so we've done query editor also and one more one question about I am database authentication and so I tried so basically instead of you'll not be able to assign roles to your on from set up but you'll you can use secret and secret access key and your access key to authenticate for that particular I am users you can provide programmatic access and those users can use their secret and access keys to enable I am databases indication but you can only use that for Arceus instances RDS databases because only there you will be able to enable it after that you can just use that endpoint and yeah you can use the security credentials to log in but you'll not be able to use I am database application on your server but you'll be able to use for is that your secret access key and access key and also the token and do that so that is the only option ok so now coming back here so saved query so coming back to the query editor I forgot to show one thing so for example if you think you'll run this query often so you can just save it here you can just save it here provide a query name let's say this is I don't even know what query is this so I'll just give very one and again query one I had query so done query saved yeah so so here so now go to saved queries so this is saved query download enter SQL a text so you can download this query so SQL text or you can copy the entire SQL text and just star if I just paste it here so we'll get it so this is this pretty easy in just save your queries over here and get them whenever you need so also this is easy to control you can just search what query you need accordingly and get it and then yeah so this is pretty much saved queries that's it so now guys so now let us do a simple demo that is I'll create a role I'll attach it to this rich of cluster so what that role basically does it does is it provides the permissions to access s3 for this particular rich shift rich of cluster so once I provide this from this rich of cluster we can query data which is stored in s3 so I have already stored some data in s3 and I'll also show from well to get that sample data there is a tutorial I followed which is a replace again I just follow AWS official white papers and documentation so from there you can get that sample data and you can upload it to an s3 packet from there you can you can provide the I am role to this particular edge of cluster after providing it you can just query it and create tables here after creating tables load data into this database after loading data you can query the data so yeah so let us start with that so for that first I have to you open I am so first I'll have to create a role as I told you I'll have to create a role so role create role or the service which I have to provide the permissions to is redshift I'll have to choose so red shift shift is over here so I'll be choosing redshift customizable there is one thing called scheduler and this is not for clusters this is for clusters so I'll just choose this and then here you will have to provide the permissions for redshift accessing is 3 so redshift does not need any other axis it just needs to access to read objects inside is 3 so let's see yeah so here is 3 reader so there is this that is Amazon s3 read-only access so this will allow the redshift cluster to only read the data from s3 so I'll just click here next tags if you want you can add a tag let's say name redshift then review provide a name hello redshift to s3 and yeah so this allows rich of clusters to call aw services on your behalf and then create role so the role yeah so the role is created so I'll just with this Feeny we will need this Aaron are ahead in this demo so I'll just copy this so this is one I'll just open another save it here okay so now we've created this so we are going to attach this role to our rich of cluster so I'll go to my cluster - I think I can more marriage here itself manage I am roles choose a role this is the role I created right now that is a low redshift to h3 click on it and then apply changes yeah so now it's modifying so we have attached a I am role to our cluster so that part is done so the next part because we are going to lower taking data from is 3 so now data should be available in s3 so I'll open the 3 console so I've already uploaded the files inside s3 I'll show which bucket yeah so it is Amazon or a bucket the bucket we created long back and I've created a folder inside that data and inside that all the text files are available so those text files I have the zip file of that and I'll also show which documentation to go for like which documentation you can go through and get the you can get the load this particular files the zip file and also from where you can get the queries so yeah so right now we have attached it to the cluster also we have the data inside our s3 database so it is over here yeah so we created a folder uploaded the data into it so right now my s3 is set data is ready now I just have to run the queries accordingly to create tables after creating tables I'll have to load that data inside those tables that will be my second depart after loading data inside that table after that I'll have to query start querying so right now let's just show that so first I'll be creating these tables after creating these tables so I'll be copying so these are some more queries so you would have thought why I copied the AR n so basically I'll have to copy this errand and paste it here so this will show so basically the query editor does not have access so for that we'll have to provide this so that the s3 bucket knows that so this particular was a I am roles so basically it allows the red shift cluster to access the data from history so while providing this Aaron is the stripping it will automatically understand we can provide the data to it so this is that I also paste these in the end and then these are some sample queries which was viable so yeah so done so now let us start off with this process so first I'll go to my query editor and then I'll just do this and do this yeah so now I have my query editor and here you can see the query results so now yeah so now first we'll have to create tables so first I'll create this table run query so completed in pound eight seconds so done I've created one table and now I'll load the data for that particular table so now I'll just copy this I can just paste it in this query and hit run or I can create a new one and hit run so I'll just run here what is this oh sorry haha I did not change the roll what's the copy user so these are some for for query editors these are the commands for copying users in the sense oh the table names yeah okay so now I forgot to change my iron that is Amazon resource number of my I am role which I just attached to my rich of cluster so I'll just paste it all over my queries sorry yeah so I that's done okay so now I think it'll work so yeah I'd so the query has run successfully so now we can just use this one query to do all the operations or we can open another query and do so we can draw do this so ice also as a best practice in the documentation they told run one query in one query page so one complete query in one query page that I don't know why but that is what they have told sometimes when you run multiple queries there are some errors occurring here it's showing some some kind of intend Dacian errors or something so that's what it was showing here so run one query at a time then but you can run a query in the same query dashboard multiple times so I'll just upload the content of this done then category so just wait I'll just do finish this off discrete multiple queries based run and then date successful done event done then there are two or listing and then sales just create one more sorry so every time you open it shows the results just taking some time to show okay so we've uploaded all the content inside our tables but one more thing is you'll not be able to view that content here that is the only problem it is just to query in query editor you will not be able to view the data you stored so for that you'll have to use in SQL client but you will be able to run queries and then see the data so yeah that you can do but there is no option for viewing all the tables available like that so you will just have to query all of them and see here yeah so if it is a client or a workbench or something like that you have everything you have your own GUI which you can use to see the tables available or the databases available and everything but in the query editor you can only query that's the only option here so now they have uploaded the content so I'll also show what was in the content so there was it was think the name of it was ticket yeah feathers it here so these are the files there are text file so here you can see so just saying it's loading yeah so this is how the content was available so you can see the number the code or something whatever is that name my name last name so whatever order and then the email the number some questions to true false true true false or something okay so so you can see there's a lot of data available so these many queries sorry this money arose so basically it will not be uploaded as single rows so if it is uploaded as single rows every time it will take a lot of time so what they do is they consider this entire column has one row 1 1 1 1 so they do that and they upload it so basically it happens in redshift and it happens transparently which are that's what redshift promises they use columnar data storage so I just wanted to show this I also give the link for downloading this sample data which you can use for practicing yeah so now we have uploaded all the content we overloaded the content from s3 into redshift so now we'll have we just do some querying and practice it so sample query is so this one is to get the table definition I think table definition of table name sales just push it up yeah so table demotion definition over here so you can see the query results over here and also you can download the CSV file from here and just save and you can download it and you can also view the results here oh yeah there is some problem with this so you can also download the CSV file and view it and that's it and then another query select some this from sales date where sales to date ID is equal to date dot date ID so they're just doing that and calculating on the date on this date they're calculating the sum of quantity sold so now when I run this query I should get a number of sum so yes so we've got that number so that is the only result we get because that is only itself we needed so this is one query and then and then there is this so here they've provided the limit as ten okay so so we get ten rows of output so showing rows 1 to 10 so you can see first name last name total quantity they're provided the name and last name and the total quantity so that's worth mentioned here select buyer ID some quantity sold total quantity from sales group by buyer ID or to buy total quantity limit so we'll just change this limit to 200 I could have just changed here so they provided hundred rows here and then another hundred rows in the next column so I think 100 rows is the limit for one particular result column and you can also download this entire CVC CSV file and just look at the results for yourself okay so so they're provided here completely yeah so this is one query I'm just doing some sample queries so to show how it works so then select and again another query same thing so here so that you can see the query results are faster even though there is there's not a lot of data but still we can consider but when there is really a large amount of data and you want to query some results they happen very fast because as I told you that's what that's how redshift has been a architectured it has one little or add the principal compute nodes which distributes the workload and accordingly complex queries are desert divided into simpler queries and they are executed and the results are clubbed in later node and that final aggregation is sent to the client tools so yeah so we've done query editor also then one more thing I'll just show events and after even's will see the monitoring part of this and then I think we will be able to do the cross region replication of rora we'll do that and yeah so we go and one more one question about I am database authentication in the last session so I tried so basically instead of you will not be able to assign roles to your on from set up but you'll you can use secret and secret access key and your access key to authenticate for that particular I am users you can provide programmatic access and those users can use their secret and access keys to enable I am database indication but you can only use that for Arceus instances RDS databases because only there you will be able to enable it after that you can just use that endpoint and yeah you can use the security credentials to log in but you'll not be able to use IEM database application on your server but you'll be to use well is that your secret access key and access key and also the token and do that so that is the only option okay so now coming back here so saved query so coming back to the query editor I forgot to show one thing so for example if you think you'll run this query often so you can just save it here you can just save it here provide a query name let's say this is all alone even know what query is this so I'll just give very one and again query one add query so done query saved yeah so so here so now go to saved queries so this is saved query download enter SQL a text so you can download this query so as SQL text or you can copy the entire SQL text and just ah if I just paste it here so we'll get it so this is this pretty easy in just save your queries over here and get them whenever you need so also this is easy to control you can just search what query you need accordingly and get it and then yeah so this is pretty much saved queries that's it so that is for changing query attributes and I'll see can delete them wherever you need just that's it yeah so I think you'll not be able to so you can also open that query in your query editor and after opening that query into the query editor you can run directly from here you don't need to copy that and paste it over here and do it and you can oh this is also one of the options and then snapshots again snapshot is the same thing and you can see here an automated snapshot of 52 MBA has been collected so this is for my cluster one so now at this point of time if I want to restore my cluster I can use the snapshot and restore this particular cluster directly from my snapshot at this particular point of and it'll provide me the enter architecture then same cluster if I have two nodes so you can see here the cluster details two nodes availability zone if I don't know whether it provides in the same availability zone if maybe the availability zone is not available that's that's the reason why you are trying to restore it but it will provide you can choose the availability zone while creating the cluster and you can see you can choose the node type you can choose the number of nodes and everything progress is 100% completed and yeah so this is much time taken one second it took just one second to do this so whatever so now we can just restore from a snapshot if you click that you will get these options cluster identify you will have to provide it if you have any cloud watch alarms already associated with your existing redshift cluster if it is available just provide yes it will duplicate all of them if you do not have you don't need to provide yes so just provide noise it will not check for any cloud watch alarms this will save some time and then you can choose are the node type here you can choose the number of new nodes so they only provide few options or snapshot size is 52 MB you can choose the region over here cluster node everything is taken default so VPC group select rack it's going to be current so yeah everything is fine so this is more than enough so restore is over here so this is more than enough to restore your cluster so just click on restore let's see yeah so done so one more thing for this yeah I think I you can also add a snapshot as scheduled so you can provide a name so let's say a snapshot one and provide a description and then take snapshot every eight hours or eight minutes or one minute or whatever you want so I'll just provide every 24 hours configure custom automated snapshot so take snapshot every eighth are 8 a.m. to 8 p.m. so take snapchat every weekend at every hour from 8 a.m. to 8 p.m. take snapshot every weekday at 2 L AM take snapshot every weekday at 1 a.m. 3 p.m. and 7 p.m. take snapshot every weekend at 6 a.m. and 6 p.m. and also you can add a custom rule so custom rule you can choose the number of days on which you will have to create a snapshot for and then after choosing the days and take one snapshot at you can provide a time or take a snapshot at this particular time on all of these days so this is UTC you will have to look up your time and UTC and then specify this so this does not so UTC is common so you'll have to check for your country's timing with UTC and then provide this so now every Sunday and Monday at this time it takes take a snapshot at every 21 cars from this time so I'll just make it two so every to us from this time to this time so from 4 14 to 6 14 it'll also basically create our I think three snapshots here one at 1614 one at 1714 one 1814 no it's to us from take a snapshot every tours so I think they are it will create two I don't know what exactly happens here so if I provide one so there will be take a snapshot every one are so everyone are in the sense that will be three two is confusing me I love will have to test it out or how many it creates so this I do not know so the basic thing I want to say is you can create a schedule also you can create a custom rule so that's how you do it and then so this is one so you can figure can figure this out and coming back okay so now going back to clusters so you can see here the cluster coffee is creating and restoring yes before you continue snapshot it's the way of backup right yeah it's a backup okay so it does it take a full snapshot every time you run it will take a full snapshot at the current time okay so that's yeah okay so do you have approximately a size like a estimate of sizes for example if it's one terabyte of database yes a snapshot does it back up all the files or it's like the OS level of snapshots not not to take the backup I don't understand what I mean like the space optimized snapshot at some level no I think it takes all the data also because it was 52 MB and yeah so I it takes all the data and it stores ins s3 I think you get so let's check that also I would assume there's a retention I don't didn't see that part like to allow you to specify how many snapshots you want to retain okay so I think it is custom automated this is automated automated retention done second I'll just snapshot retention period or kids over here the number of days for which the manual snapshot will be retained setting this parameter 2-1 days retains the snapshot indefinitely so basically if you have a snapshot you can create a copy of it or like while creating the snapshot I think you can provide this so if this is minus one then you can store this indefinitely until whenever you are so then after creating that snapshot you can delete the existing snapshot this is the copy - yeah it does so s3 storage charge backups are stored on s3 so you'll have to check how much data is stored so 52 MB so if it is for example s3 provides you 5 gb of free tier storage every month so if it goes beyond that then you'll have to check according to s3 Zaandam and storage costs so all the backups all the snapshots are stored in s3 so you'll get charged if you go above 5 GB every month so you will not be charged for that first 5 GB then after that you'll be charged according to s3 saw so let us check s3 surprising that so because that is where all the snapshots are stored and that is important storage for storage first 50 TP per month is just for standard so there are also storage classes in s3 for standard storage class that is basically frequently accessed data for that it is a point 0 2 $3 per GB so so 50 PB so you'll have to calculate it accordingly how much and then for 450 for next 450 TB it's point zero to two per GB and next 500 TB it's a point zero to one per Jimmy so basically the first 5 GB you won't get charged after that for every GB you will get charged up to 0.23 dollars Oh point zero two three dollars up till 50 DB then for other classes also the same for s3 intelligent peering you get point zero to three again for infrequent access it is point zero one to five so if you're not using your data that much if it is just accessed once in a month or something like that then you can store in infrequent access so this is for storage here but backups I think for long term backups s3 glacier will be taken by and it's very cheap so it's a point zero zero four dollars four GB all storage per month for long term backups and archival option from one minute to 12 hours okay long-term backups and archives you said Oh a retrieval option if your retrieval option for that data is within one minute to 12 hours if you want to retrieve all the data within twala's then you get charged point zero zero 4gb point zero zero four dollars per GB for s3 glacier so s3 glacier is the storage class for any kind of backup for long term backup yeah so that's it so these are these charges are for you are you storing it here these charges this is for backup and then what else this for requests and data retrieval just so if you want to retrieve data so for up to thousand requests of retrieval it costs a point zero five then this for put for storing data this is for getting yeah so for this also in data transfer this is data transfer in data transfer out yes so the various concerns here so basically we'll have to check for a glacier everywhere but these are these are common for all s3 storage classes so the main praising is here for storage for storing your data itís Oh point zero zero four dollars per GB for any number of storage per month notice one over here it's snapshots then ya snapshot schedules and then going to going back to our store so yeah they even this is available right now but I'll delete this yeah so it has to be available delete cluster I don't need a snapshot delete yeah okay so this was the restored copy of this cluster so yeah so I deleted it done okay so now we'll see events and then we'll go ahead and do the cross region replication for Aurora so events even saw the same for all kind of aw services even here it is so here you can see the event numbers and here you can see what kind of an event it is cluster copy one then restart began restart completed restart was successfully create created so there are these are various events and again here also you can subscribe you can create a subscription event choose the category you can choose everything or just a particular thing of pending so for pending these four are the category so these four events you will get to mail for configuration for info or error or info so these kind of configuration and pending and so we've selected pending also so it's showing painting so for all configuration categories so that in for errors multiple errors and then yeah so various errors and information Michael these are the events for security it is can info error and for so this for cluster security security group deleted security group updated security group created so these are for security for monitoring I think so cluster rebooted rich if rebooted cluster no one replaced so whenever there is a node replaced you also get any event so that AWS will oh it's my audio Lloyds was my audio lost for some time now you're good no no I was shown what he was lost okay I'll just continue so for monitoring yeah so there are various monitoring things for monitoring there are a lot of events because creation deletion are failing rebooting that for alarms for all kind of monitoring event this is the only configuration which you need just select monitoring and management here comes created deleted and or does not restore from snapshot cancelled from cluster stable restore then yeah so all of these basically you can subscribe from here if you want all kinds of things you can also take up the severity for example if you just want error if you just want error event mails so you just know you don't want any information as your mails right you can just check it over here so if you just want errors which are occurring at your rich of cluster as you can just choose that choose the source type on which particular thing you want events for for the cluster or for the snapshot or the parameter group so you can just choose all and then if you choose cluster you can choose a specific cluster you can choose one or if you have any number of just so you can add it so I'll just go with cluster one yeah I've added it what else here so here there is nothing then we will have to go to next step so you have to provide the name of the subscription let's say one it is enabled you can also if you want to just create a subscription and want to enable later click no here and the next if you already have a topic create choose it from there directly if you do not have created here and then if you if you know the AR n of your topic just provided here so use the existing topic I'm just clicking on a topic which I created before create so done so now any for any so it is active my even subscription is active now for any event I'll be getting mails so even subscriptions and events or almost it's not almost it's the same for any kind of AWS service yeah so now we've seen that so we've discussed today's concepts that is feel discuss about rich of clusters then we saw query editors how to edit sorry how to run queries there then we saw how to same queries we saw how to restore then after that what else we saw we saw events and also we'll see reserved nodes so if you want if you are going to use this particular rich of cluster till for one year or two years or whichever you want you can purchase from here so you'll have to select the node type you will have to select the contract you'll have to select if you are going to provide it all up front all the costs for him when you're all up front or so for example one time payment thousand three hundred dollars or partial upfront seven fifty dollars and you'll be charged 0.0 0.0 0.0 and no front you will be charged 0.2 $0 per hour but your instance will be available for one hour with you she's already one year with you fits all up front it was a thousand three hundred and single payment and also it differs with so here this is total one-time payment do you know so anyway I'm not going to pay it so I'll just go ahead yeah so now we've seen this so for today today's stuff we've done so now let us go ahead and do the Aurora cross region replication before you continue I I was hoping that you might be able to dive into the no slices no cyclize yeah so with you yeah on the computers like distribution of the data across these slices yeah I think we do not have a provision for that I think that is automatically taken care by redshift because we do not have the authority to check what exactly where exactly it is running or what it is running inside so there are no provisions here but the thing is if you have two nodes then in that two nodes there will be multiple node slices and each node slice will have its own memory and storage space and the workload coming into that particular compute node will be distributed across all those note slices also everything will be happening parallel E and then the know compute node will gather the query data send it back to the reader node and it will send back to the client application but there are no provisions here for that I think so yes it's more of a deep dive into the database concept in terms of redshift I guess you might not have that insight so I will take it back to Nash see if maybe yeah so although how it runs on the data center for how exactly it works yeah so we have Russia shipped and then we have a lot of tables loaded already okay we haven't used it for view like for life processing yet so but we heard that the key how to build the key the sort keys and stuff like that and also avoid hot spots when the query hits particular slides so we want to distribute the data evenly okay like this inside the database just a second it's not so much of oh yeah yeah I get it so I think we'll be looking more into the data sources then designing the warehouse then maintaining the warehouse analyzing curing performance so yeah so those things I'll cover in the next session okay yeah so today I just want to cover the basics of friendship then launching it and showing or what exactly is though how exactly this dashboard works so yeah I'll also you're like I'll make I'll teach you how exactly node slices and how the data is aggregated across them also in the next session okay okay so now yeah so here first we look at cross region replication here so there is no provision for directly like for example in Aurora you can just click on an option actions and just hit cross region replication and it will do all the job for you but here you should have a snapshot you should go to backup and here you see configure cross region snapshots now now you'll have to hit it so copy snapshots you'll have to choose the region where you will have to copy it for example let's say US east to you will have to choose the retention period so they've given 35 days manual snapshot retention period you can provide minus one I think the number of days for which manual snapshot will be retained sitting this parameter 2-1 days retain still snapshot indefinitely so basically this is for the automated snapshot so automated snapshot the maximum retention period for it is 35 days so what you'll have to do is you'll have to make a copy of it so that it becomes a manual retention a manual snapshot then you can provide -1 basically what that mean is what that means is the retention period is indefinite so now I'll save it so now I've copied it so now I have to go to was it us least two or best you do not remember I think I gave you a star east to so the thing is once it gets created here once it's copied in this region then you can choose that snapshot and restore your cluster here also so that's how you can do cross region replication in Amazon or Dora clusters the snapshot date at the backed up data will be the same here but there is no interconnection between these two clusters so yeah so this is how you do it in redshift but in Aurora you have a complete option for that you have a separate button for that so now we go there before you move if I create as cross region snapshot on the other side on another region and periodically I want to test it so how long does it take to restore based on the size so it's what whatever you show four different options like earlier for great glacier that's the cost for restore so is that like the price gets charged at the point when I want to do a restore test and then they will start charging and then if I test finish my test finish and then I shut it down and then stop charging like that right yeah so when you are retrieving data only that particular time will be charged for retrieval of data so when once the retrieval that is restoring from your glacier stops then you'll not get charged for that so then when I do the restore from the snapshot and based on the SLA I guess if I for example if I have one terabyte and based on SLA I want to bring up the database the entire warehouse in Lexie one or two hours say one hours okay so then I would go to what you have just shown us the speed and the charge will restore a and then based on that choose the option say I don't know yeah yeah the thing is uh so I think you'll have to set it up accordingly Amazon glacier I think you have a separate option for that also you'll have to set it up so let me check one second so you have this so I've not worked on this but another basic thing so you will have to set up a vault and you'll have to store the data in this vault and for this world you will have to provide some details so for this particular vault for your data so to retrieve these you will have to set like this should be retrieved within one hour or two words I think it's like that so once you do that then once you retrieve retrieve data so yeah so you'll be charged only for that particular time of trailing it so I've not broken it so just the same next step do not enable free to you're one of the ritual cost-free or within the you know ritual limit max it will the provision capacity units well no too much may reflect variously and could use it or charges month max if she will raise macros you a 1gb power she was oh it's not provided here so I'll exactly do not know from where from the s3 glacier from where will have to configure that but me check here so glacier standard bulk this is for get and select data retrieval requests so here standard on bulk this vocation this vocation ok provisioned so there are some various options so standard bulk bulk I think there is some provisioned capacity for this and after that you will be charged this much but I do not know the exact amount how much you'll get charged not here stay a trance for management and replication it's over here senior it's not so much the price price or it's important that the it's more okay okay so I do not know that right now I will check into that and like before the session like how I do I'll answer that first then I will proceed with the session so I'll see where we'll set it up accordingly so that for example when we retrieve if we set up one our it should restore within one are all the data of that database and how much will get charged for that yeah okay I'll check that so where was I sure yeah so we are looking at that so did it copy in Ohio this is not Virginia configure to automatically copy your automated manual search you incur data transfer costs when snapshots are copied to the to the destination region so if I create a manual snapshot right now let's let me create one so I think once it this gets created I'll also get one in my the cross region replicated a snapshot so where was that u is less to us best I think once it gets created here then it'll be copied so it's getting created so - 1 MB so you can see the type also this is manual so the thing is so when I configured it the disk this one was already available so the ones which will get created after that that will be noticed once it gets created here I think it'll get created here it will be copied here so let this get created we'll see whether it gets copied here or not at the end so first so we'll have to launch Amazon cluster or a cluster and sorry sorry's so here if you create a rotor database then cross do a cross region replication the entire database architecture will be replicated there everything will be that the data will be there the same number of compute heads will be there I'll just choose t3 dot small and yeah so I've created the database so right now it's getting created so once it gets created then we'll be able to do cross agent application so here there is one cross region read replica I think this is one then think yeah there is one cloning option where certain create clone what this does okay the database has not created just our let's wait until it gets created did we lose your voice Cody oh no I'm just checking when Abel cross region replication completed over here I think it takes some time to start it up over here so I've mentioned us east - they've configured it for close to one so yes us East one copies and absorbs everything is done so it'll it should yes oh let's see it lets see it is that at the end stayin some time to do what's the concept of the cross region replication so the remote region the database it's read-only available so the thing is so for a reader for example for one primary database in your local so you can create Phi in Phi reasons you can create fire rate replicas so like in one particular region you can create 15 to 8 replicas for one primary database so like that in file you can create up to five different read replicas in five different regions and for those regions for every reader player you can create 15 up to it's the same thing same concept but the primary database is just one one when this primary database fails then so you will have another database over here what is that so your according to your priority according to your higher priority it'll create one more for example you'll already have a replica so that replica will become your primary database but you will not be able to create a primary database directly using cross region replication in your in other regions but you'll have to take a snapshot then go copy your snapshot to the other region and then you'll have to launch a new primary cluster so if you use cross region it will just create read replicas and Plus that one read replicas you can create up to 50 more reader plickers accordingly so your cross region read replica database cluster it can have up to 15 3 15 replicas so yeah so that like that ok so it's not the whole side ok basically what I think you can just this again so MEP the same as redshift after to restore from a snapshot yes yes so for the NDB cluster we will have to do that but for a tree for creating a cross region reader picker then you can do it directly this is not the way so for real replicas I think you can it's pretty simple it's the same options you can just go with this this takes so this is your this is going to be your primary database because it's a writer so you'll have to take a snapshot of design I think you can create a new one or you can just think you can do this let's check yeah so only option would be for the primary database you'll have to take a snapshot of it and then create one more in the respective region on which you will have to create a DB cluster but for cross region replication you'll be allowed to create three replicas in five different regions after creating those really cos you can go to that read replica and create up to 15 different that all replicas for that particular replica and also yeah so the connectivity between those three replicas and this primary database instance will have a like so that is also provided in the documentation it will have a lag but that is the only way you can do so the thing is between the source database cluster and these read replicas between regions they have a the networks are totally different so obviously there will be lags yeah so right now let's just create one cross region replica so create cross region read replica you'll have to just click on the database cluster for what you will have to create one choose then choose that region I'll go through again because it's already open I will go to Oregon and so Amazon VP scene okay it does not exist in this region so I think they only provide you certain regions I think so yeah they provided Ohio so I went with Ohio it's going to be no its not multi easy you can select your DB class here also you don't need to have the same class as your primary or if you are going to use this particular region infrequently then you don't need to give a powerful instance type you can just choose a normal one or less lesap off is one then yeah then every thing will be the same just you have to provide the identifier one cluster one and then so this is not required okay enhancement hearing and I don't want performance inside so yeah so now if you create this okay you should also they provide another thing so database one doesn't have been logs enabled so we'll have to enable bin logs for this so I think I'll have to go to the parameter group which is the parameter group for this change it to customer or a one parameter groups log this on I will just make it mix it so it is on but still showing some errors for that is it saved things saved let's mix it okay so now it is available I think it modified so now let's try once again Ohio public the axis will know DV 1 DP 1 cluster you be cluster 1 failover a nothing doesn't have cluster parameter group in sync on writer it's still modifying I think once it modifies will be it's not on think oh yeah I think I'll have to there is that configuration it is in sync but he just changed the parameter group why is it okay cluster cluster parameter then that's just using my cell then lock for liner it's not available for this custom Aurora one selected customer or a one also so now we'll try once again so the thing is for cross region replication your bin loging should be binary logging should be on for the databases which you are doing databases which you are doing replication for so it's applying did it create here yeah so so the copy has been created in Oregon so we've enabled cross region or snapshot replication so it has created so now I can directly restore from my snapshot yeah so this is one thing we can do I'll just delete these snapshots okay now coming back here it's modifying so even in the last session we had the trouble of that bin logging in RDS but uh in our local my sequel databases will have what is that will have the authority to change the configuration files so we we can change or whatever we need but here the only thing we'll have to do for an able bin logging is making bin log format either for mixer or it should not be off it should be at least a row or statement or mix it so if it is there then blogging is available still taking some time it's available when degree boot so I have to reboot this so once a reboot the changes will be made so once the changes are made then we'll be able to uh I think we'll be able to cross region replicate so the thing is it's just simple to cross region replicate you'll have to choose the database cluster and hit this you can if you want to create one more replica go to add reader if you want to clone this enter database cluster go to create clone and if you want to restore to the last known snapshot not this so you can go with this if you want to create our auto scaling for replicas go with this oh yeah so right now it has to reboot once it reboots we'll check once again for cross region replication it's available so reboot has occurred you go back to configuration insane yeah so I think we'll be able to do right now let's check once again yeah so so I did the same thing in the last session but it still showed build log is not what is that not enabled so I don't know there are some bugs with this so so right now the read replica has been initiated so right now we'll have to go to Ohio region and check whether it is getting created or not it's too slow X I don't know why so I think or once it completes there I'll be able to see this just close these close things I do not have a free so I'll have to delete it I have to delete my snapshots so I think this will get deleted once the cluster is gone then fine still loading that came it came okay so there are so yeah so we've created so there are two instances I think I clicked on yeah so I unchecked multi availability zone I guess but okay I do not remember properly so so it created two reader because for me they both are readers and you can see that over here so it's a click of a button to create a cross region replica for your existing database clusters you cannot replicate your primary database instance directly you'll have to take a snapshot it and after taking a snapshot of it you will have to copy the snapshot to here after copying the snapshot to the particular region where you will have to create that database from that you'll have to use that snapshot we'll have to use that snapshot to create a database cluster out of it so that's how you'll have to do it so right now it has just created a reader reader notes sorry not nodes so it has created reader instances so DB 1 DB US East to be it has created one into a n1 and to be that is multi available to zone deployment it automatically did it for us so yeah so it will take some time to first create after creating it has to copy the data so it will take some time to load so this how we do it ok guys we've come to the end of this session I hope this session on Amazon redshift was helpful I informaiton for you if you have any queries regarding this session please leave a comment below and be allowed to help you out thank you
Info
Channel: Intellipaat
Views: 14,171
Rating: undefined out of 5
Keywords: amazon redshift tutorial, amazon redshift tutorial for beginners, amazon redshift architecture, amazon redshift spectrum, amazon redshift spectrum tutorial, amazon redshift demo, amazon redshift sql workbench, amazon redshift sql, amazon redshift sql tutorial, amazon aws redshift tutorial, Amazon Redshift Cluster, amazon redshift configuration, amazon redshift connection, Amazon Redshift Data Warehouse, AWS training, amazon redshift intellipaat
Id: ojE4RUYZl1E
Channel Id: undefined
Length: 153min 22sec (9202 seconds)
Published: Sun Apr 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.