Amazon Redshift Masterclass

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello welcome to this AWS webinar my name's Ian massing ham I'm a Technical Evangelist with Amazon Web Services based in Europe I'm going to be hosting today's session which is a masterclass series webinar focusing on Amazon redshift so our fast petabyte scale data warehouse service the master class series of webinars are intended to provide a deep dive into specific AWS services they're complemented by another series that we run called the journey through the cloud journey through the cloud is a solutions orientated look at how you can apply AWS services to solve particular use cases or deal with different business challenges that you might face you can find links to both webinar series in the links panel you can see in the webinar interface if you're watching this live or in the video description if you're watching this webinar on the AWS webinars YouTube channel also if you're watching live you can see a files panel and there is available a PDF of the materials from today's session of advise downloading that if you're watching the session live you'll find there were many links on quite a few of the slides that we're going to be covering today where you can find further information also links to relevant documentation on the topics that we're going to be covering today you can also find the materials on SlideShare and if you're watching on demand you will find a link to the relevant SlideShare presentation in the video description on YouTube the other thing that you can see only in the live interface is a Q&A panel and we're unlikely to have time to answer questions at the end of today's session so if you do have any questions please do submit these using that Q&A panel and I will make sure that a member of our solutions architecture team follows up with you in email over the course of the next couple of weeks we do answer all questions from all webinars so if you do have questions don't worry if we don't get to them during today's session we will cover them after the event also show you a couple of social media links at the end of today's session where you'll be able to stay up to date with AWS globally and also AWS here in the UK and Ireland so have twitter ready for that at the end of the session and lastly at the very end of the session we'll also put the webinar into a Q&A and feedback mode where you'll be able to give us a rate from one to five with five being the best let us know how we've done give us some feedback and help us optimize this webinar for future audiences that might watch it I just like to say that we also present the two webinar series both the master class series and the journey through the cloud series in conjunction with Intel great AWS technology partner we'd like to thank Intel for their support in helping us run this webinar program during 2015 as I said if you're interested in the back webinars that we've already recorded so far this year take a look at the series pages that you can see in that links panel in the webinar interface and catch up with the stuff that we've already covered over the course of 2015 okay that's enough housekeeping let's move on now to today's content so this is a master class series webinar it's a technical deep dive that's going to be a go beyond the basics on one specific AWS service it's intended to educate you how to get the best out of the service that we're covering show you how things work and illustrate how you can get things done with one specific AWS service and today we're covering Amazon redshift it's a fast fully managed petabyte scale data warehouse that runs in the AWS cloud it makes it simple and cost-effective for you to analyze all your data you can continue to use existing business intelligence ETL and visualization tools and it's a very scalable service that enables you to start small and scale well beyond the petabyte with current limits and it cost less than $1,000 per terabyte per year to operate this is roughly one-tenth the cost of traditional on-premises data warehouse TCO so it's a very cost effective way to get access to high performance and high quality analytical services that you can use to turn your organization into a data-driven enterprise the fundamental characteristics of the survey as well as you might expect for a data warehouse the services optimized precisely for the data warehousing use case it's capable of providing very high query performance on data sets that might range from 100 gigabytes to well over a petabyte it uses modern data warehousing architectural approaches like column storage data compression and zone mapping to reduce the amount of eye that is needed to perform queries and can offer significantly about the better performance than trying to put our LTP optimize where warehouses to use so databases to use in this use case provide an MPP architecture massively parallel processing architecture which enables the system to paralyze and distribute SQL operations taking advantage of all the resources that are available in your cluster and the underlying hardware itself is designed for high performance data processing using features such as locally attached storage to maximize throughput between CPUs and drives and a 10 Gigabit Ethernet mesh network to maximize throughput between nodes like all AWS services you can control Amazon redshift using the AWS API this is a feature that you can use to very easily change the number or type of nodes in your data warehouse in accordance with your capacity needs there also a couple of different node types that we're going to cover in a little bit more detail during today's session that allow you to optimize around IO and CPU performance around capacity you can also resize clusters as well show you towards the end of today's session on demand this is a non-disruptive operation that allows you to seamlessly migrate data between an old and new cluster with different dimensions doesn't interrupt query performance while you do that it's also a cost effective service once again like all AWS services you can choose on-demand pricing with no upfront costs or no long-term commitments or you can optimize your costing costs significantly by using discounting that's available with reserved instance pricing I'm going to cover pricing in detail today but if you want to learn more about that visit the Amazon redshift pricing page that will show you towards the end of the session once again today simple service that enables you to get started very quickly and it's fully managed and this management service also covers aspects of fault tolerance so if nodes in your cluster do fail we will transparently replace those for you we take steps to protect your data both with replication in the cluster as well as with continuous and automated backups that are taking place and we'll show you a process for restoring from backup later in today's session the I encryption features so you can take advantage of several different AWS encryption services network isolation so you can place your Amazon redshift cluster inside an Amazon V PC or virtual private cloud connect it to your own existing IT infrastructure using an industry standard IPSec VPN or our direct connect service and then lastly for compatibility it's an SQL data warehousing solution Lee uses industry standard ODBC and JDBC connections you can download our custom JDBC and ODBC drivers from the client connect tab in the redshift console and from there you'll be able to use popular business intelligence ETL and visualization tools many of which have been certified by their vendors to provide full support for Amazon redshift there's integration with other AWS services as well as these industry standard connectivity mechanisms enabling you to parallel load data for example from s3 dynamo DB or from other AWS data sources that you may be taking advantage of so it's a very very capable surveys and it's a service which has been around a little while we launched Amazon redshift during November 2012 and is the original post here from Geoff bars blog back in 2012 where he announced the launch of the service also explaining that customers like Netflix NASA JPL and Flipboard had been testing redshift as part of a private beta and in November 2012 we launched a limited public beta for the service so we've been in operations with Amazon redshift for coming up to three years now and it was at the time it was launched the most rapidly adopted AWS service in history and that rapid adoption has enabled us to iterate on the service pretty rapidly and pretty successfully over the period since it was launched in 2012 we'll come back to this most recent function most recent feature on user-defined functions towards the end of the session to talk a little bit more about this but if you are interested in keeping up with enhancements to redshift over time you may want to bookmark the URL you can see on this slide which is the Amazon redshift category on the official AWS blog and you'll find there are many many updates to these services we add new features that have been requested and back prioritized by customers and of course if there's something that you feel is missing from redshift today then do contact us via your Solutions Architect or via your Account Manager and let us know how we can improve the service for you make it more suitable for your specific use case always interested in getting feedback on how we can improve existing services so what do we have for you today in terms of the agenda first of all we're going to talk about why you might run your data warehouse on AWS as an alternative to other options that of course available to you talk about getting started a little bit on table design and data loading talk about how you can work with data what kind of tools you can use for that then we'll talk about backup and restoration upgrading and scaling scaling and we'll close out by talking about some resources that you can use to continue to build your knowledge after today's session so why run your data warehouse on AWS well in some ways there are parallels between traditional database management services that you might provide on your own premises so on IT infrastructure that you operate and with operating data warehouses in that way they're typically expensive they're difficult and time-consuming to administer they are challenging to scale and often scaling these systems can require very expensive upgrades as well as a significant investment in resourcing at your own resourcing or consulting resourcing to scale up and replace data warehouses as they grow and then fourthly there's quite a bit of lock-in with these systems part of that is actually to do with cost customers are often reluctant to change because of the significant capital investment involved in acquiring implementing and operating databases and most most more specifically data warehousing services we're trying to tackle some of these challenges when we create a damaged and red shift in response to the fact the customers were asking us to deliver data warehousing in a way which was consistent with the way in which we deliver the services so easy to provision and easy to scale up eliminating those upfront costs and available at a pay-as-you-go model with of course the opportunity to further optimize costs if you want to make a resource reservation with reserved instances offering really fast performance and being open and flexible with support for popular BI tools enabling you to get your data in and out of Amazon redshift in a very simple and familiar way and the architecture of the system is such that Amazon redshift paralyzers and distributes all operations that take place within the cluster the cluster comprises a leader node where you make your JDBC or ODBC connection and then the leader node coordinates activity with activity a cluster across a cluster comprising a number of compute nodes and these compute nodes as we said earlier are connected together by this 10 Gigabit Ethernet mesh and take advantage of local storage to deliver optimal IO throughput and we provide a framework an automation framework that enables you to query load back up with store and resize both the cluster and data that's stored within the cluster in a very very seamless manner very simple we'll cover many of these back aspects in more detail during today's session on scaling you have several options firstly there are different types of compute nodes that you can use to build your Amazon redshift cluster these comprise dc1 nodes which are dense compute nodes available in two sizes the DC 1 DC 1 large nodes comprise 15 gigs of ram 2 virtual cores they have 10 Gigabit Ethernet interconnect and they have 160 gigs of solid-state capacity in each node so by using a single node you can access 160 gigabytes of SSD capacity and then scaling with a cluster from 2 to 32 nodes you can take advantage of between 320 gigabytes and just over 5 terabytes of solid-state capacity with those dc1 nodes you then have the DC 1 8 X large nodes here you have much more memory 244 gigs per node 32 virtual cores per nodes node and the cluster size is significant large as well so you can go from two to 128 nodes and 128 nodes will give you up to 326 terabytes of solid-state storage capacity within your cluster so you're going to find these are optimized for smaller data sets and higher performance in terms of data throughput off those drives solid-state drives by virtue of their fundamental characteristics offer higher performance and lower latency i/o response then if you need to have a larger cluster capable of taking more data so working with larger data sets here you have the ds2 ax large nodes and these are rotational magnetic storage equipped so I don't have solid-state drives in them they have traditional rotational hard disks in them still carrying 244 gigs of ram but here you have 36 virtual cores to work with 10 Gigabit Ethernet interconnect of course and 16 terabytes of magnetic rotational hard disk capacity per node you can cluster between 2 and 128 nodes in this instance which delivers up to 2 terabytes of magnetic rotational hard disk capacity for your cluster so for working with very large data sets the DST way Excel is an excellent answer back to that question we've done some work as I said earlier to provide an automation framework which helps make it easier for you to run and operate Amazon redshift clusters includes features like built-in security so in transit at rest and also a mug backed up using encryption back up to Amazon s3 which is continuous incremental and automatic and we'll show you how to restore from backups later on in today's session this streaming restore methodology that's used and now allows you to resume querying more rapidly when restoring a cluster from backup and we also help deal with failure modes both disk failure modes and node failures recovering nodes and replacing drives automatically so that you don't have to take any action in the event of these failure scenarios so the typical architecture would be your clients of course communicating with your JDBC or ODBC end point and then transparently in the back Amazon redshift cluster is being protected and data has been stored automatically on Amazon s3 with the familiar 11-9 of durability that that offers and of course the facility to keep your data in a specific region specific AWS region the specified so we won't move data into regional s you instruct us to do so using one of our tools that make available and this of course is a mechanism for ensuring that you're not moving data outside of regions which may breach regulatory or compliance obligations that are placed upon you so the standard AWS security model applies to Amazon which redshift just as it does to any other service that we provide to you use some techniques that have become more common now actually in columnar MPP data warehouses but we're actually pretty revolutionary when they were first dreamed up first invented so the idea that you can store data in a columnar orientated format using compression using the zone Maps so this is a feature whereby each node knows the range of values that are stored upon it and it can therefore we only responds to queries needs to respond to queries where it's been asked for data that is present on the node and this can dramatically improve query performance by minimizing IO IO however is accelerated through the use of direct-attached storage and also accelerated for large scale or large reads at large reads through the use of large data block sizes so whole system is optimized for read intensive scan based workloads such as the type that are typical when performing queries on data so whole systems built for that and the fact that the system is built specifically for optimizing query based workloads has led to some really good results for customers this is a super query like this from Johnny Donovan was the CTO at the Financial Times talks about implementing Amazon redshift as a way for them to increase the speed performance and flexibility of data analysis and when implementing redshift they found that they achieved a 98% performance improvement in their query runtimes fact the performance was so John and his team first glance thought the system wasn't functioning because it was so quick they then ran this system in parallel with their pre-existing warehousing solution and they verified that the queries were performing quickly they were just performing extremely fast because of some of those optimization techniques that I talked about a few minutes ago they're also able to reduce their infrastructure costs by over 80 percent in comparison to a traditional data center based model so real strong success story for the FT and at the end of the session we'll show you a link where you can learn more about this case study and also other customers that have chosen to run data warehouses on R Amazon Amazon redshift let's move on now and take a look at getting started and using the AWS console to access and create a redshift cluster is very simple so in your typical AWS console view you'll find that redshift the sub console for this specific service is organized in the data base category and if you click on that you'll be taken to one or two views if you've never created a redshift cluster before you'll see this which tells you a little bit about the surveys and gives you access to some resources that you can use to learn a little bit more about how you might make use of Amazon redshift if you have used the service before you'll see this screen which is the cluster the redshift cluster dashboard and in either event you're going to click on that launch cluster button and you'll then be taken into the two familiar AWS wizard based work through to create a new service instance he has to populate some information about your cluster and identify a name for the database that you wish to use the database port that you wish to make use of a user name and a master user password that you're asked to set click continue or into those fields rather and click continue and you can then specify the dimensions of your cluster by choosing your node type and by selecting whether you want to have a single node or multi node cluster and specifying the number of compute nodes that you wish to comprise your cluster you can enter those values here click continue once again and then you'll be asked to enter additional configuration details and this is where you can do things like set security parameters for your redshift cluster for example specifying how you wish to have your database encrypted you're going to make use typically of the Amazon Web Services key management service here or a cloud HSM a hardware security module if you have one of those configured and in use in your account in the specific region you've selected you can also use a HSM kms beyond the scope of this webinar if you want to learn more about kms and hm HSN visit the AWS security center at AWS amazon.com slash security and you can find information about both of those services there you're also configuring networking options here choosing which VPC you wish to deploy the service into which will place it into a specific subnet and also into a specific availability zone when you select the specific subnet group that you wish to deploy into you can specify whether you want the cluster to be accessible from the public internet or whether you want it to be accessible only from within your private VPC you can specify a public IP address if you wish to do so or not if you wish to keep your address privately workable address only and then you can specify the availability zone that you wish to place in and then specify VPC security groups that you wish to associate with this cluster and these VPC security groups are important because these are the security groups that will allow your clients to communicate with your cluster so you're going to set a security policy on those security groups that allows a set of client hosts to communicate with your cluster and exchange traffic with it this stage you can also create and configure a cloud watch alarm and if you select the option you'll get this additional drop-down here and this allows you to specify a disk usage alarm for your cluster Publishing that to an existing SNS topic or sending yourself and sending yourself an email when the cluster reaches the particular disk usage threshold that you've set it's a good idea to do that of course if you're going to load data into your cluster it will enable you to manage scenarios where you run out of cluster storage and need to potentially resize your cluster if you hit continue at that point you'll be asked to review what you've specified you'll get a warning here if you're not applicable or eligible for the free trial and this will specify what your hourly on-demand rate for your cluster will be also offers some guidance as the way you can access documentation about reserve nodes in order to optimize your costs if you're going to win a cluster you at all concerned about optimizing your costs then check out reservation options that are available to you to minimize that and also make sure you're familiar with the Amazon redshift pricing that you can see linked out at the bottom of that screen as well once you hit launch cluster you then told that your cluster is being created by this pop up and if you close that you'll be taken to the clusters dashboard where you can see the status of your cluster as it's being created you can also of course at this point click on the cluster name and here you can see additional information about the cluster while it's under creation after a few minutes and it typically takes around 10 minutes to create an Amazon redshift cluster your cluster status will switch to available your DB health will switch to healthy and you'll show that you've come out of maintenance mode and if you click on the cluster again at this point you can expand out the cluster document cluster settings and you can see your cluster database properties at this point and at that point you're ready to start loading data using familiar JDBC and ODBC connectivity from the regular ETL or data management tools that you use so it's very very simple to setup and very very quick it's a good example actually of abstraction layer that makes it much quicker to deliver a service that's useful to you who are using AWS than it is building something similar through more traditional approaches where you might procure software in hardware and build a service yourself off to describe Amazon redshift is a great starting point for customers that aren't familiar with some of the benefits of speed you can get if you take advantage of the AWS cloud because it is so quick and simple to provision a cluster using this console based interface of course that's also available via the CLI and via the SDKs that we provide as well some additional stuff about getting started so the first of all is an excellent getting started guide for Amazon redshift that you can find if you visit the URL that's at the bottom right of this screen there's also a two month free trial for set size duster that you can take advantage of a very good getting started tutorial there's an excellent system overview there are guides for table design loading and query design which we're going to cover in a little bit more detail in a few minutes there's a description of how you can connect to your database using these industry standard connection protocols and there's also in that getting started guide some really good pointers to BI and ETL vendors have certified their tools for use with Adamek Amazon redshift so getting started guides for AWS services generally pretty good but the one for redshift in my view is one of the best so really would advise if you've got any interest in making use of the service you spend a few minutes familiarizing yourself with that getting started guide we're going to move on now take a look table design families and redshift and three topics that we're going to cover here firstly we're going to take a look at compression compression encodings and we'll take a look at data types and lastly we'll take a look at distributing data across nodes and sorting data to optimize performance and a data warehousing system has very different design goals when you compare them with traditional transaction orientated our DBMS systems so OLTP applications are typically focused on single row transactions inserts and updates but data warehouses like Amazon redshift are optimized along a different vector which is fast execution of complex analytical queries against very large data sets because of the massive amount of data involved in this kind of workload you must specifically design your database to take full advantage of performance optimizations it's precisely what we've done with Amazon redshift and the first important performance optimization is compression this is a column level operation that reduces the size of data when it stored because it reduces the size of data when it's stored it has two effects firstly it conserves storage space and secondly it reduces the size of data that is read from storage and this can dramatically reduce the amount of disk IO and therefore can dramatically improve query performance to a plot you can apply compression encoding 'z to columns in tables manually based on your own evaluation of data and that's an option that's open to you but you can also use the copy command that's provided to redshift and if you do this it will analyze and apply compression automatically and we strongly recommend using the copy command to apply automatic compression you can use automatic compression when you create and load a new table and when you do this the copy command will perform that compression analysis you can also perform a compression analysis without loading data or changing the compression of table by running a command called analyze compression against an already populated table so you can run this when you want to analyze compression of the table for future use while you preserve the existing data description language for that table data types are many different data types that are supported within Amazon redshift and one that's worth calling out is the varchar' this supports multi byte characters but only up to a maximum of three bytes so for byte or longer characters are not supported not simple it's worth being aware of if you want to learn more about supported data types in Amazon redshift visit the URL you can see bottom right here there's more comprehensive discussion and description of the data types that are available to you when you load data into Amazon redshift tables schemas need to be expressed ideally in third normal form and we're going to talk a little bit about how schema and data distribution interplay with one another now and in distributing data it's important to remember that redshift is a distributed system it's containing a leader node which contains the query planner and schedules queries and aggregates results from queries but the queries themselves take place across compute nodes and each compute node contains slices one slice per core and a slice is a logical container for data they're chosen on two types of distribution round robin robin which can be automated or on a distribution key where you define a column hash is computed and data is distributed in accordance with that hash this is very important because queries run on all slices in parallel and optimal query throughput can only be achieved when data is spread as close to evenly as possible across slices what you want to do ideally is have queries executed on cores working on data which is local to them and in an optimized distribution of the type that you can see here the distributed nature of the system can cause significant performance overhead here's a scenario where we round-robin distributed item details across sliced across slices but our order details have been distributed independently so we have no distribution key we're using default distribution which is an optimized and you can see that if we want to join orders with items that were relevant to those orders we'd have lots of internode traffic where traffic for example from node 1 relating to order 1 we'd have to make a sub query into node 3 getting data from the second slice there about item 1 point 1 and joining it and also getting data from the second slice on node 2 and joining that into the query relating to the sub items on each order to help us do the calculation you can see at the top of this slide now with an optimized distribution we're going to store the sub items for each order adjacent to them because we've used a customized distribution we've ordered the distribution key that we're using is items order ID so that enables us to curl okay the item records together with the order records on the same nodes this leads to dramatically improve performance because the zero internode traffic in this model everything's intra node on the compute nodes themselves and the aggregate data is passed back to the leader nodes much much improved performance how can you choose distribution keys where you want to look for data that's frequently joined by the most commonly run queries and you can also evaluate looking at queries that compute consume the most CPU and use that as a mechanism for identifying how you should distribute data you want a high cardinality so keys that have got a large number of discrete values because we're using hashing here we want good hash distribution and then you want to have low skew you don't want to have hotspots and you can use query STV block lists to identify skew Factory in a data set a sorting here we're talking about sorting data by a sort key and trying to choose a sort key that is frequently used in your key query so as a query predicate like the date range perhaps or an identifier always a join parameter and it can also be the hash key and this sort key allows us to use those zone maps to avoid reading entire blocks based on predicates so for example if you had a table that contained a timestamp sort key and you were looking for a recent date range you'd entirely skip reading any blocks that contained old data where the zone map showed that these blocks did not include data that was relevant to your query and again this can minimize intra no traffic so internode traffic and dramatically improve query performance so sorting is critically important for query performance optimization as well and then lastly a schema design you want to optimize your data for querying it's pretty obvious for query orientated workloads so you want to curl okay frequently join tables using those distribution keys to avoid data transfers between nodes we said and you want to use sort keys on join columns allowing for fast merge joins and also the use of zone maps and this compression can allow you to denormalize data without heavy overhead or penalty and incremental storage allowing you to simplify queries and limit joins and here's an example of square query Optima optimization on that data sent that we talked about before we're going to denormalize commonly joined into a large table in this case bringing inquiries inquiry airports and inquiry Airport groups together into one large fact table using a star schema using inquiry ideas our distribution key in our sort key and then we're going to cache the other services using dist style all which is a attribute that you can affect when creating a table if you set specified this style all when creating a table this will lead to a copy of the entire table being distributed to every node in your cluster it's a great help in boosting performance for small for small tables that you want to have cache locally all the rows in these tables for any join will be available on every node obviously this increases storage requirements and it also increases load and maintenance time for the table but it can significantly improve execution time when used with certain dimension tables where key distribution is not appropriate but you need to weigh performance improvements against the maintenance and storage overhead cost so it's good for small tables that sit around the edge of your star schema that are frequently looked up and joined when executing queries on that large fact table look caught that forms the core of the schema so might want to consider that as an option the last thing to talk about on table design not so much about design is more about maintenance really there's a vacuuming operation which needs to be used to clean up tables after a bulk delete load or series of incremental updates this is quite important Amazon redshift does not automatically reclaim the new space that is free when you delete rows and update rows to perform an update Amazon redshift deletes the original row and appends the updated row so every update is effectively delete followed by an insert rows a mark for deletion but are not removed and the query processor will scan deleted rows as well as those that are undeleted so if you have too many deleted rows you're going to cost yourself overhead cycles and unnecessary processing you can resolve this by vacuuming tables or even the entire database when you've concluded a significant number of updates or deletes this will reclaim space it will purge those deleted rows and it will improve query performance it's an important aspect of operating clusters if you're going to run them for a long period of time and make significant updates or deletes to them you want to learn more about table design there's some excellent resources that you can take advantage of in the redshift documentation which is a really very comprehensive documentation set the table design Docs that you can see there and a table design tutorial and this table design tutorials great actually it takes you through an example of how you can optimize table design in this tutorial you will start by creating tables based on a star schema benchmark schema without sort keys without distribution styles and without compression encodings you'll then load the tables and test data and system performance then you'll apply best practices recreating the tables using sort keys and distribution style Styles you'll reload the tables with test data using automatic compression and your test performance again and if you do this you'll be able to see firsthand the performance benefits of well-designed tables the whole tutorial costs about $1 to run and takes about an hour to run so really would recommend if you've not already worked through that as part of your experimental work with Amazon redshift you pick that tutorial up and take advantage of it it'll get you hands on with the system with a real data set and they'll allow you to see in practice some of the benefits that you can achieve through using these optimization techniques next thing we're going to talk about is data loading and just to provide you a quick overview there's a very rich ecosystem of AWS services that you can take advantage of to help you load data into Amazon redshift clusters we have integration with many AWS persistence service persistence services like RDS Oracle or my sequel that you might be running on instances that you manage yourself data volumes Amazon Elastic MapReduce and Glacia our low cost archiving service you can also bring data in from corporate data center resources be running on your own premises or in current location facilities using a variety of different options actually Direct Connect which is our low latency high bandwidth direct connection service enabling you to connect your data center or network directly to an AWS region using one gigabit 10 gigabit or fractional bandwidth provided by partners so if you're regularly loading data into Amazon redshift that could be a good option for you can also use a VPN connection which will terminate inside your V PC on a VPN gateway and you can connect to from your own site that allows you to secure the transportation of data across the internet using encryption and authorization techniques and then we have the s3 public API endpoint you can write to using s3 multi-part upload it's a very high-performance and effective way to get data into Amazon s3 that you can then unload from there into redshift and we have several options now for import/export so these this is a mechanism for getting data into AWS using physical media either the AWS import/export disk service where you can send media of your choice to many different AWS regions around the world we will load that I'll load the data that is on those media into s3 and make it available to you in your account or in North America you can take advantage of the AWS import/export snowball it allows you to import up to 50 terabytes of data with a single device we provide the device which is a ruger organized data transport device you can use that currently in the one region in u.s. West and in the u.s. East one region to load data into Amazon s3 at really high scale very very rapidly so you've got large data sets that you wish to move into Amazon redshift those AWS import/export options can be a good solution for you you can also share data the other way so you can dump data from your redshift database storing it in encrypted CSV files you can upload that data to s3 and you can share that data with other AWS accounts of course so you can take data out of redshift as well as taking it into redshift and s3 is a common mechanism for getting data into these services we've already said for best performance it's an excellent idea to split to files into a number of different sub files and then you can parallel load these using the copy command across all the available nodes in your cluster here you can see we're using file suffixes so we're copying customer from s3 providing a prefix and all files that have that prefix will be evaluated and loaded into Amazon redshift in parallel providing a set of credentials which have appropriate access authority to the Amazon s3 endpoint so that we can read data from that specific bucket and read these specific objects and then we're specifying a delimiter and there were actually many good examples of different approaches the loading data that you can find if you check out the URL you can see on the screen in the bottom left there's many different data loading scenarios that are explored with worked examples of that URL some issues that you should try to avoid when loading data many of these are quite obvious but it's worthwhile running through them anyway so firstly avoid mismatches make sure that data types in tables and values and input data fields match each other make sure you have the right number of columns make sure that you have mismatched quotes so you don't have mismatched quotes redshifts supports single and double quotes quotes but they must these must be balanced appropriately so read double quote must match a double quote a single quote which map must match a single quote dates and times obviously to be in the correct format where you have numeric columns with range control you need to make sure that all entries are within the range which is valid and where yet may have an issue where the number of distinct values values for a column may exceed the limitation which compression encoding again the reason it's a good idea to use automatic compression encoding using the copy command you can also load data using direct SQL all direct SQL commands go via the leader node so this is not a parallel load model but you can use it for for example joining staging tables with pre-existing target tables and in this scenario it will update where rows exist and insert where no rows exist there are performance limitations here because you're relying upon the leader no it should not really be used for large-scale data loading operations but it is available for you for example if you wanted to load in some of those smaller tables that we described as being around the outside of that star schema that might be a useful way to do that I recommend loading large data sets using the copy command as you can see here it's definitely best practice use a single copy command which will allow you to parallel load rather than load in series compress your data files with gzip once again to optimize the speed at which data files can be ingested by the nodes if you cannot use copy you can use multi row inserts using SQL as we've described already so bulk insert operations they provide high performance data insertion also some further best practices load your data to insert key order to avoid the need to vacuum at the end of the process and organize your data as a series of Time series tables each table is identical but contains different data for time ranges and those can be parallel loaded by different load different nodes in your cluster you can use staging tables to perform an upset and as we said earlier run the vacuum command whenever you've added deleted or modified a large number of rows and less you load your data in salt key order you can increase memory available to copy or vacuum and you can run the analyze command whenever you've made changes to your data to ensure that your table statistics are current so and follow those best practices and obviously this further best practice guidance in the documentation for this service so once you've got your data into Amazon redshift how do you go about working with it there's several different usage scenarios here but a common usage scenario is staging data that you might be creating in an online transaction processing ERP system you're going to store that information in an AR DBMS on your premises and this is quite a common use case by no means the only use case another common flow might be unstructured data pre-processed and enriched using for example EMR our elastic MapReduce managed service running register running Hadoop in the AWS cloud it's also a common workflow that customers will use using Hadoop to pre-process data and then loading it into Amazon redshift by creating files they staged an Amazon s3 that they can a little load in using that copy command in the use case of data integration between existing our DBMS systems though it's quite common for customers to use one of the data integration partners from the Amazon redshift ecosystem you could excuse me you can use V PC an encryption to make sure that your data remains secure all times if you do this several different options but those provided by our data integration partners so a good option here these technologies work to connect operational data stores with redshift they make transporting data over distance very simple and it makes it easy to transfer data between different database types with little operational effort a little operational overhead you can find out more about these partners if you visit the URL that you see there but we have a variety of different partners that you might want to take advantage for data integration obviously when you've got your data into the system you're then interested in business intelligence and visualization of that data this is where you make a take advantage of those standard JDBC and ODBC connectivity drivers that we provide again there's a rich ecosystem of different partners available to help you with this and you can find those at the same URL it's actually very broad different ecosystem a very very broad ecosystem different types of provider that are operating to provide visualization services many of these bi and visualization tools are available through something called the AWS marketplace and this is a an e-commerce platform that provides is vs with an opportunity to market and sell software products to AWS customers and if you visit this URL AWS amazon.com slash marketplace slash redshift you'll find a wide variety of marketplace products so 20 different products that are aligned directly to the relative use case and if you visit the business intelligence section on the marketplace you'll find over 200 different software products that have business intelligence features so have many different options open to you here are three of the options that are available under the analytics category from looker from tableau and from TIBCO jaspersoft with a variety of different licensing options a variety different cost points the great thing about the AWS marketplace is many of the ISVs that have products listed there provide free trials and many of these free trials can be bootstrap directly from the marketplace without having to speak to or make contact with the ISVs in question so it's an excellent way for trying out different visualization tools different BI tools that are available - I really would recommend that customers do take a look at it we also recently announced that the AWS reinvent a just a couple of weeks ago a brand new AWS business intelligence service called Amazon quick site this is a fast cloud plat to use bi service that makes it easier to build visualizations perform ad hoc analysis and quickly get insights from data here we're integrating with AWS data stores like Amazon redshift as well as other stores like flat files and third-party data sources and we provide an in-memory query engine that means it's possible to get your first analyses from your data in less than 60 seconds very beautiful and dynamic data visualizations sharing features so you can share analytics with people in or outside your organization the capability as I said to integrate with data sources on AWS and it comes at about the tenth the cost of traditional relicensed old-guard business intelligence tools I'm going to say a whole lot more about Amazon quick site during today's session but if you want to see the launch it reinvent and also see a demo from my colleague Matt Wood talking about capabilities of quick site and visually illustrating these in the form of a demo visit the youtube link that you can see at the bottom right of this slide you can see Andy jassi's announcement of the surveys followed by Matt's demo from reinvent just a couple of weeks ago so that might be something that you also consider as we make it generally available over the coming months backup and restoration this is very important feature of Amazon redshift and the first thing to say is that backups to Amazon s3 automatic continuous and incremental so it's not something that you need to configure other than configuring the snapshot interval and the snapshot retention period so how often do you wish to store a snapshot of your data and how long do you wish to retain these snapshots for those two variables by the way default to a 1-day snapshot interval and a 35 day retention period so your recovery point objective for this service is 24 hours essentially with the default settings you can also create user driven snapshots at any time are you simply freezing the cluster state triggering a snapshot via the console or API and that frozen cluster state will be retained unless you explicitly delete it you have control on how your clusters backups are stored in that particular scenario any point you can restore a cluster from a snapshot and make it available for querying as it was in the state at the time the snapshot was taken and it's a very very simple process here's an example of how you go about restoring from a snapshot so I'm looking here at the cluster console in the redshift console and I'm going to click into the detail of the my cluster cluster that's running and at this point I have several dropdowns available to me I can click the Create snapshot button and I'll be given the option at this point to restore from a snapshot if I click that restore from snapshot option and as to define the characteristics of the new cluster that I wish to create so for example can restore onto a different node type here and I do have to change the name of my cluster identifier when I do this restoration so I'm going to modify my cluster identifier here and then restore that cluster from the snapshot with the name restored - my cluster hit restore at the bottom and at this point my cluster my new cluster will go into creation and restoration mode and actually at the point at which creation has been completed but restoration is still underway at this point I can begin to make queries against my newly restored cluster this is where that background streaming of the restoration process comes into effect and I can actually work with data in my cluster while it's still in the process of being restored so I'll be given a new JDBC ODBC endpoint that I can work with for that restored snapshot version of my cluster which is distinct from the current cluster that I have running in my console you can already see that and that's how simple it is so it's one two three four five step process to restore cluster from a snapshot that you've previously created very very simple so cluster data replication automated backups on s3 and node monitoring all availability features that are intended to make the redshift service durable and ensure that it meets the availability level that you need okay what about upgrading and scaling clusters did just say at the top of the session that this was a very simple process and essentially what happens here is you can resize a cluster while it remains online so what will happen when you go to resize is it will put your cluster into read-only mode okay this is a very simple process so you're going to specify resize switch your existing cluster into read-only it will provision a new cluster in the background will copy data in parallel from node to node in us in a seamless and transparent manner and during this process you will only be charged for the source cluster at the point at which the new cluster is available the SQL endpoint will switch over via DNS so your tools will continue to work but you'll look be looking at the new and large cluster you've been read write mode at this point and then in the background we will decommission the source cluster and charging obviously we will flick over onto the new cluster at the point at which that endpoint switch is over so taking you through that in the console once again we're in the cluster console here we can click on our cluster drop down you'll see there's a resize option there that appears if we select that you'll see this resize cluster dialog you can change the node type the cluster type and the number of nodes here I've increased my node count to eight so I've gone off single node operation on to eight nodes and I'm then into a transparent operation so I don't create a new endpoint in this scenario in the background the cluster will be switched into read-only mode you can view resize progress in the cluster status tab there and obviously the time it takes to resize of cluster will depend on the amount of data that's in each node typically it takes from a couple of hours up to a day of their clusters with large amounts of data may take even longer than that this is because data is copied in parallel from each node on the source cloud each node on the source bluster to each node on the target okay and there's much more information about resizing that you can see if you follow the URL you can see at the bottom left of this slide so check out cluster resizing that the end of the process the endpoint will be switched over and the service will be made available for me once more take a look now quickly at some recently announced features for Amazon redshift just to bring you right up to date with new stuff that's been announced for this service the first desired trail at the top of the session is something called scalar user-defined functions you can write these user-defined functions using Python 2.7 syntax largely identical to the Postgres SQL UDF syntax for security reasons we prohibit system and network calls within UDF's but you can basically take advantage of pre-installed Python modules pandas numpy and Sai PI to perform analytic data analytic processing of your data in python and this analytic processing takes place in parallel across your cluster so like a regular redshift query it will take advantage of i/o CPU and memory capacity in each one of the nodes in your cluster there's a very very powerful feature there's much more on that in Jeff bars recent blog post that you can find if you visit the Amazon redshift category on the official AWS blog it's a very powerful feature it enables you to simplify data manipulation as well there's an excellent example of working with URLs in data using UD apps in Jeff's blog so check that of you're interested in user defined functions in Python last announcement before that was the announcement of the new ds2 instance family which I already talked about the top of the session when we were describing the different nodes that you can use to come construct your redshift cluster this was a pretty big upgrade actually delivering twice the memory and compute power of their previous dense store which pretty assesses which used to call the DW one stain same storage capacity this node type also supports enhanced networking and provides 50% more disks we put in the ds1 so it was a big performance bump for Amazon redshift when we release that back in June this year this also details on that post on cost optimization using reserved instances I'd recommend taking a look at that post if you interested in getting up-to-date with reserved instance pricing for Amazon redshift also ok resources that you can use to learn more about Amazon redshift the first resource of course is the Amazon redshift product detail page you can find many of the resources that I've talked about during the session linked off this page including the documentation sets for Amazon redshift some of the best documentation of any AWS service in my view getting started guide that we touched upon earlier in the session please do check that out also check out the table design tutorial you can find linked off that getting started guide now trail bit early but it really is excellent and well worth taking a look at one dollar one hour and you can learn hands on how to optimize table design with Amazon redshift you can find the Financial Times reference that I've talked about as well as many other examples of customers that are running analytics workloads in the AWS cloud at AWS Amazon become slash solutions slash case - studies slash analytics around 30 or so different customer case studies are there worth taking a look at you may very well find someone that's got a very similar use case to you they've already implemented using Amazon redshift it's also an awesome session from AWS reinvent 2015 just a couple of weeks old where we dive deep into Amazon redshift tuning and best practices it's with Eric Ferreria from AWS and Ari Miller from TripAdvisor you can find that if you click the youtube link that you can see at the bottom of the text on this particular slide and if you do that you'll also see a playlist pop up in YouTube that playlist contains all other big data and analytics sessions from AWS reinvent including additional content on Amazon redshift and on building data pipelines to use redshift and on using EMR and other analytic services that have been used and built by customers using the AWS cloud so I really would advise you to check out both this video and the playlist you can find that I've got as my colleague two switches into Q&A mode in just a second so you can rate us when we wrap up the webinar just remind you that we are accepting your questions during today's session and we'll come back to you with feedback over the course of the next week or two if you do submit a question during today's session so if there's anything that you'd like to know more about that we've covered the day or any other Amazon redshift or other Amazon topics for that matter Amazon Web Services topic so that one that you interested in please ask a question using that Q&A panel and we'll get back to you training and certification if you want to build your AWS knowledge more generally then check out these self-paced labs training and certification options the AWS amazon.com slash training URL contains details of all AWS training offerings and is well worth look if you've not already taken a look at it that concludes all the content that we have for today's session and ask my colleague to switches into Q&A mode now so you can give us a rating please give this webinar a rating between one and five with five being the best if you want to leave us qualitative feedback about how we've done of course you can do that using the Q&A panel as well as well as continuing to submit your questions going to leave the webinar open for 10 minutes or so so make sure we've got all of your questions today and give you an opportunity to rate us also like to encourage you to follow AWS on social media on Twitter actually stay up to date with us globally aw cloud stay up-to-date with AWS news and education events here in the UK and Ireland AWS underscore UK I and you can find me on Twitter at a.m. as well like to thank you for giving up your time to join today's session and do appreciate you giving you time to learn a little bit more about AWS hope it was a useful session for you today please keep submitting your questions and I'll see you on another AWS webinar very very soon thanks very much
Info
Channel: AWS Online Tech Talks
Views: 74,667
Rating: 4.7175398 out of 5
Keywords: Amazon Web Services (Website)
Id: GgLKodmL5xE
Channel Id: undefined
Length: 60min 10sec (3610 seconds)
Published: Wed Oct 28 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.