Data Warehousing on AWS with Redshift - with a demo!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone my name is alpena gulab and welcome to today's session so today i want to dive into some concepts some strategies and best practices for designing a state-of-the-art cloud data warehousing solution now when we talk about this state-of-the-art data warehouse it needs to be able to handle thousands of concurrent users needs to be providing the best performance and at the cheapest most predictable costs as most of us already know amazon redshift is the petabyte scale data warehouse in aws and this is the service that i'm going to be diving into today now before i get started i want to run this session a little bit differently so a little bit more interactively and to help us along with that i'm joined by my good friend my fellow essay peter peter who's going to help me to get my message across hi peter harry okay so peter has been working with our customers for some time on things like cloud migrations also re-architecting and modernization of customer workloads so peter i'm hoping to hear from you to get us started around customer perspective so what are we seeing in the data analytics and data warehousing space and what are the new realities that customers are facing cool there is in fact a huge amount of change happening in the world of data analytics for one there is an exponential growth of event data data is now viewed as a strategic asset data is no longer something cumbersome to store and something expensive to maintain but it's instead becoming part of companies business models also we see end-to-end insights from analyzing of data customers want real-time analysis and don't want to wait for data to be ingested traditional forms of data management is challenging to catalog and hard to analyze on-premise data is often in silos building etl can be complex and expensive there's also a trend of large-scale migration to the cloud many benefits here of course without compromising security um some data warehouses like neteaser announcing end of support um the question i often get asked and i'll put it to you is how do customers take steps to become data-driven companies um okay so becoming a data-driven company it speaks to getting the most value from your data and the solution here is simply modernizing your data warehouse to enable you to turn data into insights so breaking free from legacy databases then moving to managed databases and building data-driven apps and to gain insights and talking about data warehousing in particular we start off by modernizing our data warehouse so i'd like to go over here and what does data warehouse modernization actually mean so the term data warehouse it's not a new concept but there are a lot of improvements that customers desire for example scaling your data warehouse in a classic data warehouse it could happen that we can we can't run a specific report because there isn't enough storage or enough processing power to do so so these shouldn't be limiting factors um then of course you know we have to talk about speed today we want and need our data warehouse to be fast for multiple users um the data warehouse it's it's no longer just acceptable it's no longer just accessible to you know one team or one analyst for example um we should be able to scale with storage and with performance if we you know if we consider this to be a time of web services i would expect to be able to run a query or submit a request on a website as a user as a customer and it should be able to run a query on the data warehouse and give me results so you know we think of scale or think of the scale of a school or a large organization or even a country's population to be able to run queries on a single data warehouse then with added features and functionality at the end of the day it needs to be something easy to use so i mentioned that the data warehouse of today should ideally be accessible by more teams and more customers it needs to be something more intuitive for business users i shouldn't have to talk to a lot of people and to understand complex terminology to be able to benefit from the data warehouse um and one last important concept you know when we talk about modernizing a data warehouse it's the concept of data lakes so data warehouses they were and they are relational in nature but data today it's much more unstructured so you know we've got text we've got videos we've got images so it doesn't make sense holding these in relational databases and this is where the data lake comes into play now how do we combine them so say for example you've got a streaming service um you know using the data warehouse and then we've got tweets or hashtags regarding the streams that are unstructured um stored in the data link we should be able to correlate the data in the data warehouse and the data link to be able to deduce insights cool sir we already know amazon redshift is the petabyte scale data warehouse often from aws and some fun facts for our viewers the redshift service was actually launched in 2012 um and it comes from the amazon culture of innovation so it was what customers told us they needed and on the back of that you know we brought the service to the market uh redshift changed the economics of data warehousing there's no licensing costs you only pay for what you use um and today there are a lot of other cloud data warehouses to have launched however redshift remains the most widely used fastest high scale and lowest cost cloud data warehouses available so my next question then is can redshift accomplish all of the above and modernize my data warehouse so yes and yes we have tens of thousands of customers that collectively process over two x bars of data every day with redshift and here's why you know i talked about data lakes so adopting a data lab-based architecture for your analytics will help you help will help you tap into your data you can take advantage of data sharing of real-time analytics um ml ai use cases all without re-architecture because redshift is fully integrated with your data lake then you know transforming data moving it into your data warehouse it's not easy and redshift is the only cloud data warehouse that allows you to analyze any size of data in your s3 data lake without any etl um redshift also has the best performance it is the world's fastest data warehouse with up to three times better performance than other cloud data warehouses and it keeps getting faster in fact we have redshift offerings that today are 10 10 times faster than what was two years ago um another reason why redshift should be the data warehouse of choice is that it offers the best value so when it comes to pricing we are talking about predictable costs it's usage based usage based pricing instead of license based pricing so that you can get started very easily we've had customers start small so at just 25 cents per hour um you can scale up to one terabyte for just for under a thousand terabytes a year um so sorry a thousand dollars per terabyte per year and then with reserve instance pricing so this this pricing model um you can use and you can save 75 compared to on demand pricing then there's another major benefit around pricing unlike other data warehouses where you need premium versions for additional enterprise capabilities redshift's pricing includes built-in security includes data compression backup storage and data transfer then we talk about scaling so clusters can scale from 160 gb to 2 petabytes of storage so resizing is completely managed by aws and there's no limit to how often or how much your class can scale you can also scale to x bytes of data with your ss3 data lake by clearing it directly without needing to load it into the data warehouse then security always our number one priority at aws and and of course for richard um richard like all of our services includes the comprehensive range of aws security services at no extra cost finally redshift is simple to set up and operate and most administrative tasks are automated it automatically performs maintenance tasks like vacuum like analyze and optimizes queries and with reddit advisor you will receive prompts to make changes to continue to optimize your workloads cool so you've given us quite an idea of what we can do with rateshift yeah and why redshift should be the data warehouse of choice um so could we could we see it could we could we launch a cluster and build ourselves a state-of-the-art data warehouse in the cloud yes of course yes so okay so what i want to do now um i want to walk through a really quick process on how to create and how to launch an amazon rich of cluster then we're going to add a sample data set and redshift will load the sample dataset and then we'll connect to the cluster using the query editor okay so let's have a look here at the aws console so just some prerequisites to launching your cluster it's very simple very basic you just need an aws account which you can very easily create and you also just need to be aware of your company's firewall rules um you know like any database that you may need to communicate with we need to communicate with that via an open no available port okay so let's have a look at the console this is the aws management console front page and so we've signed in the console and then to get to the redshift management page um we can either type in redshift here within the search page but i've already got it here in recently visited okay so then this is the redshift welcome page and here we're gonna click on create cluster okay so there's a couple of configurations here um you know nothing too complicated okay so in the first part we've got the cluster configuration and we have to set the cluster identifier so this is just the unique name or unique key that's going to identify the cluster then the question is what are you planning to use this cluster for so there is a free trial option here um and it depends on your if you've ever used redshift before that i think there's a two to three months of using a specific instance node type for free um but i'm going to go with production because i want to show you the different options um then the next option is to choose the size of the cluster so yeah there's a help me choose option and what happens here is um you can use this option and it's based on things like you know your estimated storage space how many months of data it contains and how much of data that you want to query at a time um so i'm going to go with i choose and we'll choose the node type that is a dc 2 node type and we'll go with the dc2 large it's the smallest instance that's available um the number of nodes as we said this in this case can go up to 1 to 32 let's make it three nodes okay there's a little configuration summary there that shows the estimated on-demand compute price okay and then with the sample data i'm just going to click load sample data here and what that does is it's going to include a sample data set um with sample database and we're going to do some querying there um then there's the database configuration all we said in here is the username and password and their cluster permission so this is if you want to access other aws services you need to cl grant redshift permission so to do that you associate im roles to this but we'll do this a little bit later um then in the additional configurations section um these are things like your vpc you can set your subnet um preferably your private subnet um and you can set to change your port which is five four three by default for it so we'll leave all of this as is and we'll simply click on create cluster so generally this takes less than a minute in my experience but let's just say 45 minutes just to be safe okay so i want to just go back and make sure that there's a proper understanding of what we did and some of the configuration options that we went with um so as a reminder redshift started out as a posterous fork but we completely rewrote the storage engine to be columnar we made it an oltp relational database store by adding analytics functions and we also made it massively parallel processing systems so that it can scale we have preserved capabilities with postgres which is why you could actually use a postgas driver to connect to redshift but it's important to note that redshift is an olap relational database it's not an oltp relational database like postgres we then leverage and integrated amazon redshift with other aws services in the aws ecosystem so things like your vpcs kms and im for security um s3 for data like integration and backups and of course of course cloud watch for monitoring um all of this together makes up the service that we know as redshift okay let's now take a closer look at the redshift architecture so you can connect to a redshift cluster with both jdbc or odbc drivers and since it was forked off postgres um you can make use of of open source postage drivers as well there's the part of the cluster that the user connects to and works with this is what we call the leader node so the leader node is required and automatically provisioned in every region cluster but customers are not charged for the leader node the leader node is responsible for being the user's jdbc or odbc entry port to the cluster it's storing metadata it does query compilation and optimization and caching of result sets then there's the compute nodes so three nodes in our case these are responsible for the actual query execution and data manipulation and because redshift employs a shared nothing architecture compute nodes operate in parallel on a larger node type you can have up to 128 nodes lastly there is interaction with s3 so loading and unloading if data for example also backups and restores all of it running in the background finally one of the features or characteristics when we talk about the state-of-the-art data warehouse is virtual spectrum so with spectrum you can query data that is residing in s3 without having to load the data into directive tables um what you're pretty much doing here is extending your data warehouse to exabytes of data in your in your s3 data lake cool thanks alpine i'm familiar with redshift spectrum in theory i've read that it actually resides on its own dedicated servers which are independent of your cluster so what it does is it pushes compute intensive tasks down to the redshift spectrum layer so these queries use much less of your cluster's processing capacity than other queries a redshift spectrum also scales intelligently and also takes advantage of massive parallel processing would you be able to show us how it works now oh yes sure sure so um our cluster should be up and running now let's have a look okay so what i want to do now um let's query our cluster remember we included some sample data in there and then i'll show you how to get started with reach of spectrum okay so let's have a look and our cluster is in the status here it shows that it is available so it is up and running okay so to query this database there are different ways that we can connect that we talked about but the easiest way is just to open up our query editor here on the left hand side and then we're going to connect to the database so the connect to database button here and we can create a new connection okay seem to have lost it we can create a new connection and there's our cluster name um our database name is give and our user is the aws user okay so we can connect to our database and have a look at this public schema and here is a list of the tables that we loaded as our sample data so there's the event table um and there's the table definition we can run you know simple commands like just select can't start from event run we click the run button and just give it a few seconds and there's our query results query 59 ran it ran in three seconds and the count of the number of rows there is eight seven nine eight so we can write at different types of queries here just to give you an example just copy and paste them from our documentation um here's something that you know you can find the total sales on a given calendar we run that query also just give it a few seconds and the query results there [Music] here's another one for example you know find the top 10 buyers by quantity so just basic sql queries um that you can run directly from the query editor directly onto the rear chip cluster so again just give it a few seconds and there's our query results of the top 10 buyers so you can of course load your own data you know using the copy command but i want to show you how easy it is to take advantage now of the features of redshift so we spoke about richard spectrum which you can use to query data in amazon s3 files without having to load data okay let me show you how to get that going so we have our cluster and we have an s3 bucket um so it's called aws sample database um now we need to allow redshift to access the s3 service or the s3 bucket then we'll create the external schema then we'll create the external table which references the data in that bucket so first to allow redshift access to s3 we can do this by associating it with an im role so if you remember earlier when we created the cluster there was an option to do that but we can also just go back to the cluster now and edit it and i'll show you how so back to our main page we can select the cluster and under actions we can manage iam roles so we have a role here that i created called my spectrum role or we'll do is associate the im role and save changes okay so all external tables must be created in an external schema the external schema it references a database in the external data catalog and this can be athena data catalog and it could be a glue data catalog so we're going to create an external database in the amazon athena data catalog okay so to do this again we'll head to our query editor and just make sure we're still connected and then we'll just run a create schema command and what that looks like okay so we're going to create the external database in the amazon data catalog and to do that we will just run a create command so create external schema it's called spectrum and we're going to reference the im role that we just added okay we click on run and we'll just give it a few seconds to create that schema and once we have that we'll create the external table to do that again we'll just run a simple create table command it looks something like this so create external table the table is called sales and it's referencing the location in s3 and there's our bucket over there so again just click run and give it a couple of seconds to complete okay so with the external table created we can query it now using the same select statements that you would use to query any other reach of tables so for example you know just select star from the same table that we just created okay and give that a couple of seconds to run so remember we spoke about the event table in our database so we can now run a join command with the external table the spectrum sales table with the local table called event so something like this that's our join command and again just click on run and give it a couple of seconds there to complete okay and there is the result set so this is the result from joining um the table on our data data warehouse with a table running in s3 so let me show you something that's also pretty cool um the query plan we can have a look at so we can up click here on query 203 to look at the details around that or we can go here to view all the queries that was run um on our range of clusters so here is our join command so we can open that up and if we look closely at the query plan we can see that it is running here for example s3 query plan query scan of sales table that shows us that it's scanning the s3 table so yeah that was almost too easy it was awesome thanks so much alton that was very cool so one of the biggest challenges i've seen with customers is managing that substantial amount of additional data there's a constant trade-off between what data to load into the data warehouse what data to archive it to storage it has to do with managing costs and keeping the etl complexity low i can see with redshift spectrum it makes it easy to analyze large amounts of data in its native format without you requiring you to load the data but of course redshift as a data warehouse is scalable right how easy is it to scale a redshift cluster because we obviously need to make sure we minimize any downtime to the business right um yeah yes absolutely you know to so to increase query performance or to increase cpu memory you can very easily increase the number of nodes in the cluster um and there are two options around this so there's the classic resize option um this is the older option that's required um it requires the cluster goes into a read-only mode and depending on the size of the cluster this could be for an hour or up to a few days and this is because the way it works is it provisions a new cluster while copying data over from your source cluster it then distributes the data to the new nodes according to the distribution style and it runs the analyze command then there is the elastic resize option so this is the recommended option um it requires less time it does not however it does not sort tables or it does not reclaim disk space um so it takes about 10 to 15 minutes during which time that it's that the cluster is in read-only mode there are some sizing specifications or limitations with the elastic resize in that you can only change to certain instance types and that the number of nodes that you can change to it would be i think it would be either double or half let me show you what i mean here okay so this is you know back to our dashboard here and we've got our reach of cluster you can just click on that and open that up um if you want to scale it we'll simply go to actions and we'll go to resize so as i said two options elastic resize and classic resize um we're currently using the dc2 large so there are a few options here that we can change too and we'll leave it at the dc too large um we're currently running three nodes so we can only increase it to double of that which is six volts there's the option to resize the cluster now or you can schedule it at a later time and that's that we can click resize cluster now and this process it again it takes about 10 to 15 minutes um and we should have we should then have a cluster of six nodes could you um you mentioned different instance types or node types how how do i decide which one to use um sure so there are three types of notes um let me show you um from the create page that we looked at earlier so if we have a look here there is uh we'll start at the bottom there's the ds2 nodes um these are the legacy nodes so they make use of hard disk drives um and we actually recommend using ra3 nodes instead so if you're using eight or more ds2 nodes we'd recommend rather that you upgrade to ra3 and you'd get two times more storage and you'll get better performance for the same on-demand cost um then there's also dc2 nodes the one that we launched now so dc2 nodes they have local ssd storage included so data is stored locally for high performance um for data sets under one terabyte we recommend dc2 node types and if you expect your your data to grow then we'd recommend using ra three nodes so you can size compute and storage independently so that's the ra3 node um they have managed storage and you can scale and pay for compute and storage independently so the managed storage these are high performance ssds and when the data of a node grows beyond the size of the local ssds managed storage automatically uploads that data into amazon s3 and also to also just keep in mind that all three of these node types you would choose the size and the number of nodes based on your performance requirements and of course you'll only pay for what you use um on on the topic good with good segue to pricing the pricing structure how does redshift offer the best value compared to other cloud providers um so with all three of these node types that we had a look at you know there is a on demand pricing structure so you pay for capacity by the hour with no commitments and no upfront costs and you'll simply pay an hourly rate based on the type and number of nodes in your cluster there's also this pause and resume feature that i'm not sure you've heard about it allows you to suspend on demand billing during the time that the cluster is paused um then there's also something called reserved instances so with reserved instances you can benefit from significant savings over on demand um by commit committing to use redshift for a one to three year term um then there's three options here there's no upfront where you'd pay nothing upfront and you'd commit to paying monthly um there's partial upfront you'll pay a portion and then the remainder over the rest of the term and of course it's all upfront we would pay the entire reserved instance to in one upfront payment cool um so we covered redshift spectrum and connecting to the data lake we looked at how redshift is the most scalable the best value with predictable pricing models now what about security we all know security is the number one priority at aws um and also a huge concern for my customers so how does amazon redshift keep our data secure so first redshift like all of our aws services it benefits from the aws security building blocks being part of the aws ecosystem means that redshift as a service integrates with you know aws im for example amazon vpc also aws kms which is our key management system um to protect data in transit or at rest um you mentioned iam identity and access management so that reminds me does richard support single sign-on yes um so if you want to use your corporate identity providers whether it's azure active directory whether it's active directory federation services or any other saml compliant identity providers you can configure it if to provide single sign-on cool oh and um what about mfa then as well yep yep yes so you can use multi-factor authentication for additional security when you're authenticating okay now that's awesome um just one other thing can you access those compute nodes directly so no this is not possible compute nodes are in a private network space um it can only be accessed from your data warehouse clusters leader node and if you think about it this provides an additional layer of security for your data okay got it um just one last uh question on the topic of security um how granular can you go with access controls can you have can you have column level security um yes you can you can have column level security um you can set it up so that users would see only the data that they should have access to um and you can do this by granting or evoking column level privileges to a user or to your user group cool so now that we've covered uh the security aspect just one more challenge um yeah what does redshift do just in terms of kind of improving performance performance management sort of things um okay so when it comes to improving performance there are there's of course a number of things that we can look at um right sizing instance notes for one um running a vacuum or analyzing your data we can also look at two new queries for example but there's the other option it's it's redshift workload management feature so wlm that allows for the separation of different query workloads so your goals here they are to you know to either prioritize important queries for example or to throttle less important queries um you can also use it to divide up cluster memory or to set query timeouts for long running queries so by default redshift enables automatic wlm and this is done through parameter groups the default parameter group has automatic workload management enabled and we can create a parameter group for manual workload management and with automatic workload management what happens there is redshift determines the amount of resources that queries need and adjusts the concurrency based on the workload of course in contrast manual workload management requires you to then specify values for query concurrency and for memory allocation um one more thing that i want to add here workload management also has an option for sqa which is short query acceleration and what happens here is short-running or lightweight queries they can complete even while long-running queries are active but redshift has many options for improving performance i mean we can talk for hours on how to use workload management um also making use of concurrency scaling um there's data distribution as well and keys compression as well of data which decreases i o which we can all read about awesome um speaking of data uh given that you know it is so important and it's a strategic asset i noticed when we created the cluster i didn't see an option for backups or backup scheduling can you make the proctored backups briefly up huh yeah so this is something that's super special about redshift it replicates all of your data within your data warehouse cluster when it's um when it's loaded and also continuously backs up your data to s3 what this means is that there are at least three copies of your data so there's the original and the replica on the compute nodes and a backup in amazon s3 you can also configure it for dr so you can replicate your snapshots to s3 in a different region when you need to restore a backup you have access to all the automated backups within your retention window and once you choose a backup to restore we'll then provision a new data warehouse cluster and restore your data into it i can show you quickly um how easy this is okay we can also just take a look at our resize that we kicked off earlier okay back to the dashboard and we can have a look here at our rich of cluster one and as you can see we're looking at six nodes that we resized earlier so when we're talking about backups and restores okay there's already one automated snapshot here click on the one to open up that screen um it was taken today it was taken just now so to be able to restore it we'll just select it and restore from snapshot and we'll get some similar configuration options that we get in the create screen so we'll give it a name [Applause] um the node type size the number of nodes again um database configurations and we click on the restore option neat so you mentioned the automated backups um and you're storing now what would happen if i delete my data warehouse cluster to those backups yeah so automated snapshots are deleted along with the data warehouse when you delete the data warehouse but you also have the ability to specify whether a final snapshot is created upon deletion um so it enables a restore of the deleted data warehouse cluster at any stage and i think i can show it to you here we just select the cluster very easily we can click actions and we can click on delete and there's the option to create a final snapshot that you can tick off and delete cluster cool yeah i think that's a nice bit of extra protection um so now that i have a pretty good understanding of redshift um and i hear you i think it's it's pretty amazing service um to take advantage of it what what's next how how would i migrate i'm glad you asked that question and i am prepared for that question so i actually have some slides on this topic um so you know when we talk about the customer migration journey you know the first step for migration is always the most challenging you know sometimes we find that organizations can get paralyzed you know they can't figure out how to move every last workload but there's no need to be overwhelmed you know and i iterative and iterative approach it's almost always the right one you know we often work with organizations to do a portfolio analysis to assess each application and then build a plan for what to move you know in the short term the long the median term and the long term and beyond that the aws ecosystem is broad nd it includes tools and technologies to cover you know nearly any solution that you want we've designed aws database migration service that's dms to be simple you can get started in less than 10 minutes then there's also the aws schema conversion tool which converts your commercial database your data warehouse schemas to open source engines or aws native services such as aurora or redshift and i think that the real power of the solution becomes apparent when you realize that you can move between database engines and between data warehouses and move away from expensive commercial databases you know to cloud native to open source solutions and don't worry at aws we don't believe in vendor lock-in so you can use dms just as easily to move data out of the cloud as you do into it and i think that is a good place to round up so thank you so much for your help today peter i really hope that i've answered all of your questions and that we can conclude that amazon redshift is in fact your state-of-the-art cloud data warehouse solution wonderful thank you so much everyone for your time i hope the session was informative i certainly learned a lot um please remember to fill out the survey at the end of the session and alpana and i will stay online now to answer any questions you may have thank you bye thanks so much
Info
Channel: Cloud Architects in Africa
Views: 1,477
Rating: undefined out of 5
Keywords:
Id: QwP_B9t4_-Q
Channel Id: undefined
Length: 39min 15sec (2355 seconds)
Published: Wed Aug 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.