Amazon EMR Masterclass

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello welcome to this AWS webinar my name is Ian massing ham I'm a Technical Evangelist with Amazon Web Services based in Europe I'm going to be hosting this session today and today we're back on our master class webinar program we're going to be focusing on Amazon Elastic MapReduce today which is a service running managed Hadoop workloads in the AWS cloud before we get started with that I just want to thank Intel for sponsoring this webinar series and in fact for sponsoring our other webinar series for 2015 as well the journey through the cloud and if you interested in a solutions orientated look at how you can apply a variety of different AWS services into particular use cases or particular business challenges that you or your organization might be facing then you might be interested in taking a look at the journey through the cloud webinar series you can find it in the links panel in the webinar interface and also in the video description for this video if you're watching us on youtube today other things that you can see in the live interface for the webinar first of all the materials download in the files panel if you want to grab a PDF of the materials the slides from today's webinar you can find those in that files panel please do download them if you are joining us live you'll find that very useful because there's quite a few links embedded in the slides that will take you to further information about a lot of the topics that we're going to be covering today whilst this is a deep dive webinar it's actually a lot of material associated with this topic elastic MapReduce it's very hard for us to cover everything during the session so make use of those links you can find out more about many of the topics that we'll be covering today also in the webinar interface if you're joining us live you can see a Q&A panel if you have any questions at all during the session today I'd encourage you to submit those via that Q&A panel try to answer as many as we can at the end of the session today and we're going to experiment with something new actually following today's session which is going to be a virtual office hours for Amazon Elastic MapReduce this will be a another webinar which will take the form of a expansion on some of the questions that we've been asked today and also an opportunity for those of you that have got follow-up questions to ask those live during the session and we'll try and address those during that follow-up webinar if you're interested in that please let us know at the end of the session using voting panel that you'll see or let us know via the Q&A panel that you'd be interested in attending that office hours and if we have sufficient interest we will run that as a follow-up also at the end of the session you'll be asked to provide some feedback and rate us today give us a score between five and one with five being the best let us know how we're doing these you know at AWS we're very orientated around customer feedback it's really helpful for you to let us know particularly things that you think we could do to improve this session if this topic areas that you'd like to hear more about are stuff that you think we've gone into too much detail on please let us know so that we can continue to optimize these sessions in subsequent years if we run this webinar again next year it's also really helpful for stab to have that feedback and lastly you can see my Twitter account there on the screen am will show you a couple of Twitter accounts at the end of the session today that you can use to stay up to date with AWS news and events here in the UK and Ireland but also internationally without AWS cloud so please stay tuned for those and do follow us on social media at the end of the session okay let's get the housekeeping out of the way let's move on now and talk about the content for this session at the Amazon Elastic MapReduce master class and as you'll know these master class series webinars they're intended to be a technical deep dive that goes beyond the basics of one specific AWS service intended to educate you on how you can get the best out of it show you how things work and how to get things done and we've got an increasing emphasis for 2015 and including demos in these masterclass series webinars and you're going to see a couple of demos today of workflows with Amazon's elastic MapReduce service before we get to that what is EMR what is elastic MapReduce well provides a managed to dupe framework that you can use to quickly and cost-effectively process vast amounts of data in the AWS cloud using parallel distributed execution of data processing tasks this makes it fast easy and cost-effective for you to process data and with EMR you can use a wide variety of other distributed frameworks that exist within what's generally known as the Hadoop ecosystem and we're going to cover many of those later in the session today what their usage cases are and how you can access these other distributed frameworks within the ecosystem through Amazon's elastic MapReduce service the basic characteristics of the service well it's easy to use you can launch an EMR cluster in just a few minutes using either the AWS console command line or one of our SDKs and we take away the heavy lifting of provisioning nodes just to set up a doob configuration and tuning EMR takes care of those tasks for you so that you can focus on the analysis or processing of your data it's low cost with simple pricing which is predictable you paying an hourly rate for each instance hour that you use and you can launch a 10 node Hadoop cluster for a little as 15 cents per hour so very very cost effective and it has a native support for ec2 purchasing options spot and reserved instances enabling you to save 50 to 80 percent of the underlying instances if you use these purchasing options we'll talk about how that can affect the cost of running jobs on EMR towards the end of the session today it's elastic so you can provision one hundreds or thousands of compute instances and process data at any scale bear in mind that you may need to have your account limits uplifted if you want to operate EMR at scale just as you do if you want to operate ec2 at scale you should contact us if you want to do that and you can easily increase or decrease the number of instances that you're using for your cluster and you're only going to pay for what you use it's very reliable we've tuned Hadoop for the EMR platform for the AWS cloud integrating it with services like s3 and cloud watch for storage and metrics respectively secure and we'll touch upon this later how you can encrypt data inside EMR using s3 server-side encryption or client-side encryption with EMR FS including integration with our key management service and also integration with Identity and Access Management I am how you can control precisely who and what you're who can use your EMR cluster and precisely what actions your EMR cluster can take with other AWS services using I am and lastly flexibility you've got complete control of your cluster you have root access to every instance you can easily install additional applications customize your cluster in a huge variety of different ways and we also support multiple different Hadoop distributions and as I've said earlier a wide variety of different dupe ecosystem tools so one of the common use cases for EMR these are just three examples actually the usage that customers put this service to is much broader than this but clickstream analysis so this is the processing and analysis of high volumes of logging data that might come from web applications or from ad take use cases very very common use case for EMR genomics where customers are processing vast amounts of genomic data in fact other large scientific data sets as well using EMR as a framework to build parallel computing tasks on and lastly log processing from web and mobile applications maybe for optimization purposes for a be testing analysis or to gain insights into the patterns of usage of your applications amongst your users very very common use cases for EMR now that we've introduced the EMR let's take a look at the agenda for the rest of the session so we're going to talk firstly about Hadoop fundamentals how does the distributed processing engine work in a dupe and what sort of benefits can you get from using that engine i will then talk about the core fundamental features of VMR take a look at how to get started with EMR including a demo of a more traditional batch based streaming workload or workflow on the EMR platform well then take a look at supporter to dupe ecosystem tools additional EMR features this section is going to include a lot of references to locations on the Amazon EMR documentation site when you can find out more about those features because there's a lot of detail there which we're probably not going to have time unfortunately to dive into you Joe in this session then we'll quickly cover third party tools before closing up with a look at resources that you can use to learn more about EMR so Hadoop fundamentals why is Hadoop useful why is it risen as a tool for dealing with big data problems big data challenges and the answer to that is there's an increasing requirement and amongst customers to deal with very large volumes of data says really rose with things like the rise of ad tech and large-scale web applications with tens hundreds or even billions of users hundreds of millions or even billions of users today this has led to an increase in difficulty of extracting information from these very large data sets so Hadoop was developed as a mechanism for parallel execution of data storage and processing tasks in essence allowing you to split these very large data sets into many small pieces distribute those across a cluster in our case it's in the earmark luster but it could be a generic Hadoop cluster and this allows you to execute compute tasks close to where the data is stored across the cluster using a combination of a distributed job scheduling task scheduling framework and a distributed file system the Hadoop file system HDFS this distributed processing allows you to aggregate the results from all the nodes together and ultimately extract your insight from that very large data set and to do that in a fraction of the time of more traditional analytics approaches so that's what Hadoop is in essence the core features the EMR provides around that in the sense that it's a framework for running Hadoop inside the AWS cloud we touched upon this briefly already elasticity enables you to quickly and easily provision as much capacities you need and add or remove capacity at any time it's very useful if you have variable which is very common or unpredictable processing requirements and for example if the book of your processing occurs at night you might need 100 instances during the day and 500 instances at night and with EMR you can quickly and easily add those additional 400 instances just paying for them during the period that they're running now this elasticity can take two forms really you can deploy multiple clusters if you need more capacity you can easily launch a new cluster and terminate it when you no longer need it and there's no limit to how many clusters you can have as long as your ec2 instance limit on your account is sufficiently high you can launch as many clusters as you have instances available to you you may want to use multiple clusters if you have multiple users or applications and you can store your input data on s3 and then launch multiple application processing clusters in EMR working with that same data set you can optimize those in different ways some might be optimized for CPU others might be optimized for storage or i/o or memory secondly you're able to resize running clusters it's very easy with EMR if you're storing your data in HDFS and you want to temporarily add more processing power you can temporarily add additional instances to access that processing power and you can do that in a way once again which is very elastic adding and removing capacity as you need secondly we've already mentioned it's low cost and some of the reasons that EMR is low cost well firstly you've got the low hourly prices pricing of ec2 that you're going to take advantage of there's also an opportunity to further reduce pricing through integration with ec2 spot market which is a mechanism for bidding on what's currently unused ec2 capacity which is in a spot market with a floating price point price fluctuates based on supply and demand for instances but you'll never pay more than the price that you specify as a maximum when you make a bid and the EMR is integrated with spot instances which means that you can same save both time and money by introducing introducing additional capacity from the spot market reducing the time that it takes to execute your job and doing that with capacity that is a lower price point than on-demand instances similarly you can also reduce your cost by making use of the reserved instance integration and reserved instances allow you to save up to 65% of the on-demand price for instances by paying a low one-time up from free ok this gives you a significant discount on the hourly charge for that instance so if you are going to be repeatedly making use of the same instance type over a period of time it's a very good idea to evaluate whether whether or not reserved instances may offer you a more cost-effective mechanism for accessing that capacity s3 integration again this allows you to decouple compute from storage with something we've talked about in our last journey through the cloud webinar a very important characteristic of the cloud you can shut down clusters when they're no longer and if you store your input and output data in s3 essentially staging in s3 and then loading it back into EMR we need to apply computing resources to it and e/m are has very strong performance when reading and writing from to s3 and I will talk later in the session today about how you can parallel load data from s3 into EMR but this characteristic of decoupling compute from storage capacities are important way to optimize costs when running Hadoop workloads on the on the EMR service and this low cost has been verified by a third party if you want to take a look at an independent study which takes a look at the cost comparison between running Hadoop workloads on Amazon EMR versus running them on more traditional deployment models for Hadoop for example private Hadoop clusters that might be operated on your premises or in a colocation facility take a look at this study from Accenture which covers that in some detail there's a link there at the bottom of the page that you can follow that for that to find out the full details of that study but in essence who do in essence extensions conclusion was the EMR offered better price performance than the alternative deployment mechanisms for creating an operating Hadoop clusters its concept of flexible data stores is the third core foundational component of EMR and as we're here looking at mechanisms that we can use to reduce the cost of processing large amounts of data in integration with a variety of different storage options can be an important characteristic of that we just talked about s3 but you actually have several different options available to you HDFS where the native distributed file system within hadoop itself dynamodb where you might store frequently accessed data may be addicted exchanging very frequently being changed by applications very frequently and then join it into EMR workflows redshift are petabyte scale data warehouse Amazon Glacia are low cost archiving service and of course the Amazon relational database service these are all options that you have for storing data inside the AWS cloud and integrating it with the EMR service as we mentioned the most foundational those is s3 allows you to decouple your storage and computing resources you can make use of s3 core features such as server-side encryption for data stored in s3 when you launch your cluster EMR will stream data in from s3 and make it available to the compute resources inside EMR and as I said a second ago multiple clusters can process the same data concurrently there are of course those are the storage options as well as I mentioned to you Hadoop distributed file system or HDFS in EMR HDFS uses local ephemeral storage on each one of the instance nodes and depending on your instance type is this could be either spinning disks or solid-state storage every instance in your cluster has that local ephemeral storage and but you still decide which instances will run the HDFS component for you EMR calls instances that are running HDFS core nodes and instances that are not running HDFS task nodes so you need to be aware of that knowledge er DynamoDB as we mentioned earlier EMR has direct integration here so you can quickly and efficiently process data stored in dynamodb and transfer data between dynamodb s3 and HDFS using EMR if you wish to do so and then we also support other data stores and we have an additional service called data pipeline which we'll touch upon later that provides a mechanism for moving data between different data stores and EMR as well as on-premises data stores and DM or EMR specified intervals if you wish to do that and you'll see that EMR sits at the center of an ecosystem with those data storage options at the bottom and then have a variety of different Hadoop ecosystem tools sitting above the surface as you can see in this chart here and we'll come to some of those ecosystem tools in a few minutes and before we do that let's take a look at getting started with Amazon Elastic MapReduce one of the steps that you need to go through in order to make you to this service well the first thing is to develop your data processing application you can use Java hive which is an SQL like language Pig a data processing language of variety of other options including Ruby Perl Python our PHP C++ on hope Jas and Amazon EMR provides a set of code samples in tutorials that help you get started with developing data processing applications using these different options you can find those at the URL that you can see on this slide here and there's actually a wide variety of different articles and walkthroughs about developing data processing applications in those different language options once you've done that upload your application and data to Amazon s3 if you've got a very large data set you might want to take advantage of Amazon's s3 multi-part upload API I which we covered actually in some detail in our and there's an s3 master class which you ran a few weeks ago you can find on youtube if you're interested in that so that's one option that you have parallelizing large object uploads into s3 in fact the AWS CLI for s3 will do that for you automatically if you upload objects above a certain size fresh hold you can also do this yourself with software that you might develop using the AWS SDKs for Java or.net PHP Python or Ruby second option you've got is to make use of import/export so this is sending media devices with data on them to an AWS region and we can transfer data then into s3 for you which you can subsequently load into EMR so if you need to do one-time or periodic loads of large data sets that's an additional option that's available to you and then lastly if you want to do much more regular uploads of large data set you may datasets you may want to consider making use of AWS Direct Connect which is a mechanism for establishing a high-bandwidth low-latency connection directly from your premises straight into an AWS region via a direct connect point of presence which we have here in Europe in London and also in Frankfort so check out aws.amazon.com slash direct connect if you're interested in that once you've got your data in s3 you can then configure and launch your cluster and you can start our cluster using the console CLI tools or SDK as we talked about earlier first thing that happens when you do this is that a master instance group is created this is an instance that he's going to control the cluster secondly a core instance group that is created and this is going to persist for the lifetime of the cluster that you've created the core instances as we've said earlier use that ephemeral storage to provide HDFS storage capacity and they run data node ask tracker demon such a components of Hadoop and then optionally you can add those task instances which are compute nodes which do not use their ephemeral storage as HDFS capacity but can act as workers with work distributed to them via the master node which manages coordination and distribution of work and cluster state across both the core and task instance groups once your Hadoop cluster is running you can then optionally monitor it using the management console command line interface SDKs on the api's themselves and EMR is integrated them as in cloud watch for monitoring and alarming as well as supporting popular monitoring monitoring tools like ganglia you can then add or remove capacity to and from the cluster at any time and handle more or less to handle more or less data and for troubleshooting you can use the consoles simple debugging user interface which we'll talk about later once your cluster has completed its task or its series of steps you can then retrieve your output either from s3 or from HDFS if you've set the cluster to persist after the flow has completed and you can then visualize data with third party tools like tableau or MicroStrategy you can automatically terminate the cluster when processing is complete if you wish to do so which case you need to make sure that you're placing your results inside s3 for collection alternatively you can leave the cluster running maybe give it more work to do interactively or retrieve the output directly from HDFS if you do that and if you do use s3 it can be used as an underlying file system for input/output data okay that's a very very powerful feature of the service as we've said early because that enables to decouple your compute from your storage and really attack the costs of running a large-scale purchasing workloads inside EMR a doob clusters okay now we're going to move on and show you a demo of getting started with EMR using a sample Hadoop streaming application we're going to jump into the console in a second and show you how to get set up while the cluster completes and runs the task will come back and talk through a few more slides explaining precisely what's happening then we'll go back and take a look as usual we're going to start our demo logged into the AWS console and you can see that again as in most cases we're operating in the island region here in EU s1 so can be working with the EMR service today so we're going to click into the EMR console and I'm going to put a filter on here to remove clusters that have terminated I've had quite a few clusters running over the last few days I've been prepping for this demo we're going to start with the fresh cluster here I just want to show you that we've got nothing running we've also got an empty s3 bucket over here called en masse AWS - EMR which is the bucket that we're going to be using to work with over the course of this demo just want to show you that's empty at the moment so let's create a new cluster and we're going to configure a sample application which is an application called word count which we can select from this drop down here and that will do what you expect it will count the occurrences of words in a large number of text files that are stored in a particular s3 prefix the output location that we're going to use is the bucket that we just looked at a second ago which I have here and I'm going to put my logs in to an additional prefix in that same bucket as well so my logging data is there and my output location is specified and I'm going to tweak a few aspects of the job flow for this particular demo before we do that we're just going to name our cluster using a tag you'll know that you can tag the majority of AWS services and you'll see the other console option options allow you to do things like select the specific hadoop distribution that you're going to be working with install additional applications if you wish to do so configure filesystem encryption and also filesystem consistency specify the ec2 instance types that you wish to use for your master core and tasks notice we've covered earlier specify aspects of security and ask access iam roles bootstrapping actions and then also to modify the steps of the particular job flow that we're going to run now I'm going to make a couple of changes here here you can see the mapper that we're going to be using and the reducer that we're going to be using and also the input location I'm going to make a change to the output location here because I'm going to add an additional step into my job flow and I'm going to put my output into this intermedia at prefix in the output bucket that we touched upon talked about earlier and then I'm going to add an additional step which is going to be another streaming step and here I'm going to use the mapper bin slash cat and I'll talk more about this later on a specific reducer again which we'll cover when the job flow is running in a couple of slides time so we can talk about that in a moment and I'm going to change my input location so I'm going to be working with that intermediate location where I stored the output from that previous step just to show you how to work with a multi-stage job flow and I put my output into a specific output location that you can see here and I need one argument for the second task which is to specify that I'm going to have only one reducer running and again we'll talk more about why we've done that in a moment okay so we've set up our cluster now we've set up our job flow and if we hit create cluster you'll see that our cluster creation activity starts and you're taken to this status page where you can modify the status of the EMR cluster that we've just created you can see there that we're currently provisioning ec2 capacity let's jump over onto the slides now and we'll talk a little bit more about precisely what we've created there in terms of our job flow so what we created there was an Hadoop streaming job flow now Hadoop streaming is a utility that comes with the Hadoop distribution it allows you to create and run map and reduce jobs with any executable or script as the mapper or the reducer so you can write your processing applications for example in Python or PHP or another scripting language and you can use it input from standard input and output through standard output to execute the actions that are required to run your processing workflow so by default each line of input/output represents a record with tab separated keys and value and will talk to you and show you actually in a second precisely what we used for our reducer and our mapper in a second but it was a two-stage flow as you'll remember and that was shown in the console in this way with this word count flow which we modified from the and an application from the Sun sample application and then a second additional step that I added myself so step one we're using that utility for Hadoop streaming and the arguments that we had there there were files where we specified the particular mapper that we were going to be using wood splitter dot py which I'll show you in a second is a small piece of Python input so this is all the objects in the prefix that you can see there and that's publicly accessible s3 bucket and the output is our intermediary location where we're going to store our results for processing by the second step in our job flow and the mapper it's a very very simple Python script that you can see here it's going to read words from standard in line by line and it's going to output to standard out tab delimited records one for each word in the format long Val some colon the word that it finds and the number of occurrences it's going to be one in every instance actually we're going to execute an output that each time we encounter a word so that will give us something that looks like this and the reducer aggregate which is a built-in reducer will basically sorts the sort the inputs that it finds and add up the toe ends up the totals so if you've got three occurrences of the word abacus that would become become abacus tab 3 so the input as I've said all the input in the objects in that particular s3 bucket prefix the output is written into that intermediate location and one output object is created for each reducer which is generally one per core on the ec2 instances that are running the reducers reducer task phase and then we added a second step into our job flow and the reason for this is we wanted to reduce our output so we have a single output file and we're going to use this profile for that a map of bin cat which accept anything and returns it as text a reducer which will sort the input from bin cat we're going to output into output that so we're going to take the previous output as the input and then we're going to output that into the output location that's specified here and because we're going to have a single reduced task that will give us a single output object now let's jump back into the console and take a look at how is going on so you can see that our jobs terminating now with steps completed which means it's successful and in fact if you go into the steps drop down that you can see here you'll see that it tells us the steps that have been executed as part of our flow so successfully setting up Hadoop debugging successfully executing the word-count step and then successfully executed executing that input so the aggregation step of that streaming program that we created using the console and if you jump back into or if we jump back into the ste console and take a look back in our bucket you'll see that we've got several files several prefixes that have been created here now the intermediate which contains the output from that first phase and you'll see we've got six so seven output files there the logs which are comprehensive and we'll cover precisely what's produced in logging later but you can see that we've got a set of logging data that's been generated there by the execution of our task flow and then lastly our output and if we take a look at our part file here first of all we've got a file called underscore success which objects call them to score success which indicated that our job flow completed successfully and then we've got part here which is our output file there's only one of those because of the additional step that we added to our task if I just minimize this window I can show you the output if I bring this onto the screen and you can see here this is the output from our word count application with where we've got precisely what you would expect you can see there's a very large number of words in our file there's over 20 almost 30,000 different word occurrences that have been counted they're obviously some of those have got large numbers of occurrences there there's November with twelve hundred and forty-five occurrences so you can see that we've read and processed a very large amount of data there as part of that processing workflow so that concludes that quick demo of setting up batch based activities using Amazon EMR let's jump back now to the slides and continue with a session continue now by talking about some of the Hadoop ecosystem tools supported by Amazon EMR and if you're familiar with the dupe familiar with the Hadoop ecosystem you'll be familiar with many of these tools because these are the commonly used open-source components that have extended the Hadoop ecosystem over the last few years hive which is an open-source data warehouse and analytics package that runs on top of Hadoop providing a query language called hive QL that allows users to structure summarize and query data it's extended to support some of the functionality provided within hadoop such as MapReduce functions and extensible user defined data types such as Jason now Amazon Web Services as extended hive for integration with EMR providing for integration with DynamoDB and Amazon s3 for example you can load table partitions automatically from Amazon s3 and write data to tables in s3 without using temporary files as an intermediary step you can also access resources in s3 such as scripts for custom map produce operations and additional libraries pig is another commonly used used open source analytics package that runs on top of the do this uses a language called Pig Latin as you probably know there's another SQL like language that allows use allows users to structure summarize and query data as well as SQL like operations and thus further support from that produce functions and user-defined data types and this allows for complex processing of complex unstructured data sources such as text documents and log files again there's further improvements that have been made to pig by Amazon Web Services for EMR including the ability to use multiple file systems generally Pig can only access one remote file system but we've removed that restriction as part of the improvements that have been made for a deploying pig on the EMR platform lastly HBase which is an open source non-relational distributed database provides a good solution for storing an efficient actually solution for storing large quantities of sparse data using column based compression and storage techniques and the HBase is optimized for fast lookup of data because mem data stored in memory rather than on disk and it's able to provide sql-like queries over h based tables joins with hive based tables and support for java database connectivity drivers also with EMR you can backup HBase to Amazon s3 doing either full or incremental manual or automated backups and you're able to restore from previously created backups on demands there's quite a lot of extensions to those commonly used open-source to dupe ecosystem tools that have been created specifically to make them work more effectively when running these tools with the EMR other ecosystem tools that are supported Impala this allows for interactive ad hoc query in using SQL syntax and instead of using MapReduce MapReduce as it leverages an MPP engine similar to that found in traditional RDBMS systems within parler you can query your data in HDFS or HBase tables very quickly using regular bi tools through ODBC and JDBC drivers that are provided uses the hive meta store to hold information about input data including partition names and data types and you can build a responsive data warehousing system on top of EMR using Impala did that for a demo were last year actually at a conference in London and I've still got some material on that so if you want to hear more about that ask me in the questions panel through the Q&A and I'll send you some further information about getting started with Impala Presto this is a open source distributed SQL query engine once again for running interactive analytic queries integrates with s3 and supports and C SQL and you can use bootstrap action actions to install and run resto on EMR and process data directly from s3 there's a good post about that on the AWS Big Data blog actually which is reasonably recent so if you want to learn more about presto on EMR maybe that's a good source to check out and then lastly Hugh this is a pretty cool tool actual this is an open source user interface for Hadoop that makes it easier for users to develop and run hive queries to manage data in HDFS and to run and develop Pig scripts and to manage tables as well and it provides a web-based interface and GUI for 40m integrating again as you might expect with s3 and you're able to query data directly against s3 and easily transfer files between HDFS and s3 it's a relatively new addition to EMR the next thing I want to do during the session today show you a quick demo of setting up and getting started with huion EMR using some sample data sets that are provided there so let's take a look at that now a quick demo of the patch ehue on EMR we're back in the elastic MapReduce cluster list as you can see here and our previously active cluster is now disappeared if we look at terminated clusters you'll see it's at the top area that all steps completed we're going to start a new cluster and this is going to be a cluster that persists for interactive operations okay because we're going to be working with this particular cluster over the course of a quick demo showing you the apache hue environment again we're going to give our cluster a name a tag which contains a name and we're going to just check that we're having the hue application installed and we have you can see it's here Version three point seven point one we don't need to do anything else to this cluster other than ensure that we have an ec2 key pair and we need an easy to keep X we're going to be using SSH local port forwarding to get access to the web interface for the hue console so that we can work with you and that's it the main difference here is that our cluster is not set to auto terminate we want this cluster to persist so that we can work with it interactively so we'll just click on the create cluster button and I'm going to do a bit of webinar magic here and I'm going to time shift things for you so I'll speed up this next section of this video here and we'll return in a second when this process has been completed and here is available for us to work with interactively you you can see there that that process is now completed and we're in Waiting after step completed which is our current status which shows the Felicity is ready for the next action if you take a look once again at the steps panel down here you can see that those setup steps have been successfully completed now the next thing we need to do is set up an SSH tunnel to enable us to access the interface for Hugh and I have that command in the terminal window on my other screen which will just quickly show to you so put into on this command that'll create a tunnel from port 81 60 on my local machine through the SSH tunnel obviously on port 22 and then connecting to port 80 a 88 on the remote instance which is the port that the Hugh graphical user interface is exposed upon and if I bring back my browser and load this URL which is 8 to 160 on my local machine you'll see we're presented with the first login screen for Hugh I can put a password in here and that will create an admin account that will allow me to work with the user interface for Hugh create and execute tasks there and I'm going to use one of the sample projects that's available this s3 access log sample which is available in the default installation of Hugh just to show you the interface and how it works this is a hive query that's previously being created and you can see that the structure of the hive query is defined here we're mapping to an external table which you can see is at this location here so an external table in a publicly accessible s3 bucket which contains s3 access logs in this case now we're using a surday a serializer deserialize ER which has been defined which you can see here which is specifically for dealing with s3 access logs however at the bottom of this table specifically at the bottom of this specification here you can see that the structure of that table is specified so the structure of the data is here basically bucket owner which is a string bucket name which is a string date/time string and so on so this is the format of the data that's stored in this particular location and then there's a series of queries that are defined here a series of hive queries that are defined so we're going to get the distinct s3 paths that were accessed we're going to get the number of get requests per requester we're going to get the access count for each s3 path after a given timestamp and you can see there are hive QL queries here that correspond to each of those queries and if we execute here you'll see that each one of these queries runs in turn from the hive script and the data after a few seconds is presented to us so here you've got detailed logging information about the execution of this particular hive job on this EMR cluster and after a minute or so you'll see that our results are made available to us here you now we have the results from that query so this is the distinct access keys for the s3 paths which have been accessed and we can cycle through the rest of the queries in this script just by hitting next I hope you can see this provides a really good graphical interface for people that might be less technical users to work with data using an EMR cluster through this hew interface okay that pretty much concludes what I wanted to show you with you so we're going to switch back over now on to the slides and just finish up today's session next thing to talk about is the latest to do ecosystem tool for which support has been added the EMR and that's SPARC so Apache spark is a new execution engine in the Hadoop ecosystem that is really very very good at providing more rapid responses more responsive execution of data processing and analytics tasks than the traditional approach using MapReduce uses a specific execution engine called a dagger directed a Kosilek graph execution engine this allows it to create a more efficient query plan for data transformation this was covered in much more detail in a recent post on the 16th of June on the AWS blog written by John Fitz from the Amazon Elastic MapReduce team if you want to get the full details on support for SPARC its availability I'd really recommend that you go and check out that post you can find the URL at the bottom of this so at the top of this slide here really is the latest evolution of EMR it's very very simple to create a cluster and you can see an example here of cluster creation using the AWS CLI tool for EMR then SS aging into the master node and once you've done that you can invoke the SPARC shell or the Python SPARC shell using PI spark and either unable to work interactively with a spark instance or rather an instance of spark cluster which has been created using the EMR framework and there's an example here of executing a simple string counting workflow using the PI spark shell which you can see execute once again using test data that's available in a publicly accessible bucket on the Amazon s3 so I would recommend going to the URL you can see at the bottom of this slide which is the developer guide for running spark clusters with EMR now I'll give you a whole lot more detail about how you can make use of spark and run your spark applications on spark deployed using EMR okay last thing to talk about really in the main part of the content for this session is additional EMR features these are in my view important integrations between EMR and other AWS services I'm going to cover six or eight different topic areas now quite quickly and provide you with documentation references where you can go and learn more about these specific features how they might be useful to you and how you can use them the first is about network control in your EMR cluster you can launch your EMR cluster inside an Amazon virtual private cloud or VPC in it you may know this is a logically isolated section of the AWS cloud that gives you complete control over your virtual networking environment including selection of your own IP address range creation of subnets configuration of route tables and network gateways you can also connect v pcs back to your own private network or your own wide area network using VPN gateways and the AWS direct connect service this means that it's quite possible to create a private Hadoop cluster a private EMR environment inside AWS which is not accessible from the internet or public in any sense it's available solely to users on your own internal network and now we have many many customers that do that today another option that you have is to integrate with services like Amazon workspaces this is a service for running desktops inside the AWS cloud and using workspaces will enable you to run bi tools a Jason to your EMR cluster for optimal performance and also security so you could run your for example in very simple terms things like Excel query tools inside Amazon workspaces very before wouldn't costal be on the network adjacent to your EMR cluster and also none of your data would leave the EMR environment or your V PC rather if you did that another option of course that you have is to use local port forwarding to access Hadoop service interfaces over SSH as you see me do in my demo a few a few minutes ago there I used SSH local port forwarding and there's a URL at the bottom of the slide here which has further details on setting up SSH local port forwarding for a variety of EMR web interface use cases comes here there are several web interfaces available to you with the EMR not just Hugh but other other services as well and there are full documentation details on how to set that up in the URL that you can see at the bottom of this slide next on the same topic really of security is management of users and permissions and of encryption you can use AWS identity maxixe management or I am and tools such as I am users and roles to control access and permissions for example it's quite simple to give certainly users read but not write access to your clusters can also use various encryption options for data in Amazon s3 that's going to be processed by EMR you can use either server side or client-side encryption to protect the data that you store in s3 with server side Amazon s3 protects your data with encryption after you upload it with client-side encryption you manage the encryption and keys but you can use the AWS key management service or kms to manage your keys for you to simplify that you can enable both of these options or cluster creation time as you'll have seen in the GUI where I quickly showed you that encryption options set they're mutually exclusive global settings so you can't combine them but when either option is taken or enabled rather all Amazon s3 write up right actions happen through EMR FS using that form of encryption that you've selected so there's plenty of options for securing your data using the interface that you can see there on the slide in the console or via the command line tools you can install additional software with bootstrapping actions using Amazon s3 and this can able you to install software and also to change the configuration of applications on the cluster bootstrap actions are scripts that are run on cluster nodes when EMR launches the cluster they have run before Hadoop starts and before the node begins processing data you can write custom bootstrap actions or you can use predefined bootstrap actions that are provided within the EMR service and there's a documentation link once again at the bottom of this slide that includes full details of all of the predefined bootstrap actions that are provided by Amazon Web Services as well as guidance as to how you can create your own bootstrapping actions if you need to for example tweak settings in your JVM or you want to install another Hadoop ecosystem tool that is not currently supported by AWS you can do that yourself using bootstrap actions when we contribute that back as well efficiently copying data from EMR sorry to TMR from Amazon s3 this is very important feature this allows you to parallel load data into EMR from s3 using MapReduce so you can use MapReduce to run multiple parallel load processors parallel load tasks using a java application called s3 disk dist CP which has been developed by AWS and made available so you can see there is a standard component of EMR and this allows you to parallel load data making use of all of your cluster resources to very rapidly draw data down from s3 and load it into HDFS for subsequent processing within EMI you can see a sample command line there and as usual there is a link at the bottom of the slide that includes a lot more detail as to how you can make use of s3 disk CP to improve your improve your data loading speeds and on the topic of data loading if you want to repeatedly load data into earmark Luster's or set up a regular scheduled recurring workflow we have a separate AWS service called AWS data pipeline that can help you do this it enables you to create complex data processing work for is that our fault tolerant repeatable and highly available you have to worry about managing independent inter task dependencies or dealing with failures or timeouts data pipeline is designed to take care of these issues for you deal with these issues for you and able you to more quickly and easily get access to data that might be distributed across a variety of different sources or locked up in on-premises data silos so if you're interested in that AWS Amazon com slash data pipeline cluster monitoring not a whole lot to say about this other than that there are 23 custom Amazon EMR metrics within cloud watch within the cloud watch service the AWS metrics service such as things like the average running number of running map and reduce tasks and you can set alarms on these metrics as well full details on viewing metrics within cloud watching the specific EMR metrics that are available available to you at the URL you can see on the bottom of this slide about monitoring your cluster debugging I guess another form of logging in many respects as I showed you during my first demo logging data is created each time and an EMR cluster is run and it's stored in an s3 bucket in prefix location that you provide a cluster setup and there's several different log type log files available to you step logs Hadoop logs boot up action logs and instance state logs the full details on debugging using logging at the URL you can see at the bottom of the slide there's also access to debugging data and also for metrics actually through the EMR console interface itself you can browse the logs in an intuitive way through that interface as well as seeing a dashboard of metrics so in creating jobs that may be useful for you take a look at the plan debugging guide the URL you can see at the bottom of this slide and learn more about how you can debug applications during the development phase on EMR there are always one a one additional a dupe distribution available to you which is the map power distribution this is a commercially provided generally called enterprise-grade Hadoop distribution that supports a broad set of mission-critical and real-time production you too many features too many additional features in the map our distribution to go into here but it includes features such as business continuity with cluster mirroring allowing you to replicate differential data in real-time across clusters obviously useful if you're running mission-critical or high availability workflows on the Hadoop platform you may wish to make use of map path for that it provides a random readwrite access in the standard NFS interface so that customers can mount their Hadoop cluster and leverage standard file based applications with Hadoop including using standard ETL workflows that might load data onto NFS exports that could be a useful feature for many use cases and there are a lot of other features as well o DBC drivers for hive fully automated provisioning installation and configuration of map our clusters via EMR and others so if you want to learn more about that AWS amazon.com slash elastic MapReduce slash mapper and then there is tuning a couple different tuning options are available to you firstly there are a variety of different ec2 instance types they're supported by EMR you can have general-purpose instances which have balanced CPU and memory you can go for computerized or memory optimized which have proportionally more CPU power or proportionally more RAM or you can go for storage optimized with the d2 instance family which is new and provides for for instance sizes with 6 12 24 and 48 terabytes of local storage and these are dense storage instances very large data storage use cases and lastly we also have the GPU instances with the nvidia kepler floating-point cards available in them so if you want to combine hadoop with applications that demand really intensive floating-point performance then there's an option available for you to do that as well so one way to tune your cluster for cost and promote performance is to select the optimal instance type for your workload the second way to tune is to inject is to inject resources from the AWS spot market as you slightly counterintuitive in some respects that by adding additional compute capacity reduce the cost of executing jobs in EMR but it's definitely the case here you can see a worked example of a four node cluster built using on-demand instances which has a total execution cost of 28 dollars and a nine node cluster built using a combination of on-demand and spot instances where spot instances spot instances are provided at 50 percent of the cost of the undemanding census so that would be setting a maximum bid price of one half of the on-demand price and the cost there of running the job under those scenarios is $22.75 so a twenty two percent cost saving and a fifty percent time saving it's costing less and the task is being completed more quickly by injecting additional resources from the spot market it's a very good tip to make use of you running workloads in EMR to experiment and model them with the use of spot resources to establish whether or not that make you might make your workloads more cost effective to execute okay last thing we're going to look at is third-party tools I'm just going to stop talk very briefly about this really just to provide a reference and say that if you check out the product detail page for EMR on the AWS website you'll find a listing of third-party tools that can provide additional features and functionality together with EMR things like visualization tools data integration IDs data transfer tools performance tuning tools and others you can see a sample of vendors that are represented there here and if you have a specific use case that you think may require some of the functionality that is provided by these tools or additional functionality that we haven't covered during the session today ask a question or check out the product detail page for EMR particularly the partner section and check out with them there may be partner tools that can help you achieve what you want there in the AWS ecosystem okay just before we close up let's take a look at some resources that you can elute you use to learn more about Amazon's elastic MapReduce service most important of course is the EMR product page at AWS amazon.com slash EMR all of the documents is linked off this page this did develop a guide there's details on running spark on EMR there's a host of other case that is an example some of the highlights of the content that you can find there in the developer guide you'll find it getting started with Amazon EMR tutorial a walkthrough actually several different walkthroughs for running different data analysis and processing workflows on the Amazon EMR service that's a really good resource to check out actually took my first steps with EMR using one of the guides that's there which is a Twitter sentiment analysis walkthrough using open source Python tooling and the Twitter API keys that you generate yourself to gather tweets that contain certain terms and you can then run your own sentiment analysis against those tweets using NLP techniques and an open source Twitter library called twee pea that you can run on it on an EMR cluster so that's a really good way to get started hands on with the service and familiarize yourself with it that's how I took my first steps with the EMR there's also a host of case studies for big data use cases on AWS many of which feature EMR but also other services are covered there as well buzzer all AWS amazon.com slash solutions slash case studies slash big data and then lastly the Amazon EMR documentation a great source of information all of the references that I've talked about today pretty much come from the Amazon EMR documentation site we also have a training offering which focuses specifically on running big data workloads on AWS and has extensive coverage of EMR within it it's the AWS big data specialty course you can find it too AWS amazon.com slash training which is a central URL on this slide also you may want to check out AWS certification and self-paced labs options as well okay that concludes everything that we have for you during today's session we've probably got time for a few Q&A s so if you have any questions please submit those via the Q&A panel that you can see in the webinar interface I'll try and answer as many as that many as I can in the session if there are questions we can't get to during the session today we'll definitely come back to you on those in email please rate the webinar we're going to switch the webinar into Q&A mode please give us a rating from one to five with five being the best leave us some qualitative feedback in the Q&A s if you feel the things we've done well or things we could improve we'd love to have that also let us have your vote as to whether or not you'd like to see a follow up a virtual office hours focusing on Big Data and Amazon Elastic MapReduce if that be useful to you please indicate that in the voting panel now so that we can run that a follow up session if that's useful for you don't forget to follow us on the social media accounts that you can see on the screen AWS underscore UK I for us here in the UK and Ireland or AWS cloud four as globally you can find me Iain M and last thing to do before taking a few questions is just to thank you for giving us your time to attend this session today we do appreciate your spending time to learn a little bit more about Amazon Web Services like to thank you for for joining today's webinar on giving up your time to join us okay so okay let's see let's turn to a few questions now thanks again
Info
Channel: AWS Online Tech Talks
Views: 67,563
Rating: 4.9103141 out of 5
Keywords: Amazon Web Services (Website), Analytics (Industry), Apache Hadoop (Software)
Id: zc1_Rfb_txQ
Channel Id: undefined
Length: 59min 23sec (3563 seconds)
Published: Tue Jun 30 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.