What Is MapReduce? | What Is MapReduce In Hadoop? | Hadoop MapReduce Tutorial | Simplilearn

Video Statistics and Information

Video

Captions Word Cloud

Captions

introduction to map reduce mapreduce is a programming model that processes and analyzes huge data sets logically into separate clusters while map sorts the data reduce segregates the data into logical clusters thus removing the bad data and retaining necessary information why mapreduce prior to 2004 huge amounts of data were stored in single servers if any program ran a query for the data stored in multiple servers logical integration of the search results and analysis of the data was a nightmare not to mention the massive efforts and expenses that were involved the threat of data loss challenge of data backup and reduced scalability resulted in the issue snowballing into a crisis of sorts to counter this google introduced mapreduce in december of 2004 and the analysis of datasets was done in less than 10 minutes rather than 8 to 10 days queries could run simultaneously on multiple servers and search results could be logically integrated and data could be analyzed in real time the usps of mapreduce are its fault tolerance and scalability let's look at a mapreduce analogy the mapreduce steps steps present here daniel's vote counting after an election as an analogy in step one each polling booth ballot papers are counted by a teller this is a pre-map reduce step called input splitting in step two tellers of all boos count the ballot papers in parallel as multiple tellers are working on a single job the execution time will be faster this is called the map method in step 3 the ballot count of each booth under the assembly and parliament seat positions is found and the total count for the candidates is generated this is known as the reduce method thus map and reduce help to execute the job quicker than an individual counter the mapreduce analogy vote counting is an example to understand the use of mapreduce the key reason to perform mapping and then reducing is to speed up the job execution of a specific process this can be done by splitting a process into a number of tasks thus enabling parallelism if one person counts all of the ballot papers and waits for others to finish the ballot count it could take a month to receive the election results when many people count the ballot papers simultaneously the results are obtained in one or two days this is how mapreduce works let's look at a word count example in this screen the mapreduce operation is explained using a real-time problem the job is to perform a word count of the given paragraph on the left the sentence in input says the quick brown fox jumps over a lazy dog and a dog is a man's best friend we will then take this sentence through the corresponding steps of splitting mapping shuffling and reducing the mapreduce process then begins with the input phase which refers to providing data for which the mapreduce process is to be performed the sentence used is as input here the next step is the splitting phase which refers to converting a job submitted by the client into a number of tasks in this example the job is to split into two tasks one for each sentence then the mapping phase refers to generating a key value pair for the input since this example is about counting words the sentence is now split into words by using the sub string method to generate words from lines the mapping phase will ensure that the words generated are each converted into keys and a default value of 1 is allotted to each key or each word in the sentence in the next step the shuffling phase refers to sorting the data based on those keys as shown on screen the word sorted into ascending order the last phase is the reducing phase in this phase the data is reduced based on the repeated keys by incrementing the value of each key where there's a duplicate word the word dog and letter a are repeated therefore the reducer will delete the key and increase the value depending on the number of occurrences of the key this is how the mapreduce operation is performed map execution phases map execution consists of five phases the mapping phase the partition phase the shuffle phase the sort phase and the reduce phase the assigned input split is read from hdfs where split could be a file block by default furthermore input is parsed into records as key value pairs the map function is applied to each record to return zero or more new records these intermediate outputs are stored in the local file system as a file they are sorted first by bucket number and then by a key at the end of the map phase information is sent to the master node after its completion in the partition phase each mapper must determine which reducer will receive each output for any key regardless of which map or instance generated it the destination partition is the same so for a single word that word would always go to the same destination partition note that the number of partitions will be equal to the number of reducers in the shuffle phase input data is fetched from all map tasks for the portion corresponding to the reduced task's bucket in the sort phase a merge sort of all map outputs occurs in a single run and finally in the reduce phase a user defined reduce function is applied to the merged run the arguments are a key and the corresponding list of values the output is written to a file in hdfs map execution in a distributed two node environment the mappers on each of the nodes are assigned to each input split a box based on the input format the record reader reads the split as a key value pair the map function is applied to each record to then return zero or more new records these intermediate outputs are stored in the local file system thereafter a partitioner assigns the records to the reducer in the shuffling phase the intermediate key value pairs are exchanged by all nodes the key value pairs are then sorted by applying the key and reduce function again the output is stored in hdfs based on the specified output file format the essentials of each mapreduce phase are shown on the screen the job input is specified in key value pairs each job consists of two stages first a user defined map function is applied to each input record to produce a list of intermediate key value pairs second a user defined reduce function is called once for each distinct key in the map output then the list of intermediate values associated with that key is passed the essentials of each mapreduce phase are as follows first the number of reduced tasks can be defined by the users second each reduced task is assigned a set or record groups that is intermediate records corresponding to a group of keys third for each group a user defined reduce function is applied to the recorded values and four the reduced tasks are read from every map task and each read returns the record groups for that reduced task reduce phase cannot start until all mappers have finished processing so combining your output is an important step once all the tasks are completed mapreduce job a job is a mapreduce program that causes multiple map and reduced functions to run parallelly over the life of the program many copies of map and many copies of reduce functions are forked for parallel processing across the input dataset a task is a map or reduce function executed on a subset of this data with this understanding of job and task the application master and node manager functions become easier to comprehend first the application master is responsible for the execution of a single application or mapreduce job it divides the job requests into tasks and assigns those tasks to node managers running on one or more slave nodes the node manager has a number of dynamically created resource containers the size of a container depends on the amount of resources it contains such as memory cpu disk and network io it executes map and reduced tasks by launching these containers when instructed to by the mapreduce application master mapreduce and associated tasks the map process is an initial step to process individual input records in parallel the reduced process is all about summating the output with a defined goal as coded in the business logic the node manager keeps track of individual map tasks and can run in parallel a map job runs as part of a container execution by node manager on a particular data node within a cluster the application master keeps track of a map reduced job the hadoop map produce job work interaction initially a hadoop mapreduce job is submitted by a client in the form of an input file or a number of input splits of files each containing data the mapreduce application master will then distribute the input split to separate node managers the mapreduce application master then coordinates with those node managers the mapreduce application master will now resubmit the task to an alternate node manager if that data node should fail the resource manager gathers the final output and informs the client of success or failure status let's look at the characteristics of mapreduce mapreduce is designed to handle very large scale data in the range of petabytes and exabytes it works well on write once and read many data sets also known as worm data mapreduce allows parallelism without mutexes the map and reduce operations are performed by the same processor those operations are provisioned near the data as data locality is preferred in other words we will move the application to the data and not the other way around commodity hardware and storage is leveraged in mapreduce to keep things cost effective and the runtime takes care of splitting and moving data for operations some of the real-time uses of mapreduce are as follows simple algorithms such as grep text indesign and reverse indexing such things as data intensive computing which would include sorting large and small sets of data stream data and structured data data mining operations such as bays in classification which you'll study later and search engine operations such as keyword indic ad rendering and page ranking enterprise analytic analytics to ensure the business is operating smoothly and with the best decision making data available gaussian analysis for locating extraterrestrial objects in astronomy which uses very large data sets and semantic web and web 3.0 indicing and operations data types in hadoop data types in hadoop the first data type is text the function of this data type is stored to string data the writable data type stores integer data long writable as the name suggests stores long data similarly other data types are float writable for storing float data and double writable for storing double data there is also boolean writable and byte-writable data types null writable is a placeholder when a value is not needed this illustration here shows a sample data type that you can create on your own this data type will need you to implement a writable interface as you can see writable will define a deserialization or serialization protocol every data type in hadoop is a writable writable comparable will define your sort order all keys must be of this type but not value then in writable and long rideable and the various concrete classes that you'll define for your different data types lastly sequence files refers to a binary encoded with a sequence of key value pairs input format it's in mapreduce mapreduce can specify how its input is to be read by defining an input format the table lists some of the classes of input formats provided by the hadoop framework let's look at each of them the first class is key value text input format which is used to create a single key value pair per line text input format is used to create a program that considers a key as the line number and a value as the line itself n-line input format is similar to text input format except that there are n number of lines that make an input split multi-file input format is used to implement an input format that aggregates sequence style one format to be implemented the input file must be a hadoop sequence file which gains serialized key value pairs we set the environment for mapreduce development first let's ensure that all hadoop services are live and running this can be verified in two steps first use the command jps as shown type sudo jps and then look for all five services that you need node data node node manager resource manager and secondary name node there may be additional services that are used by a hadoop cluster but these are the ones that we require as core services next let's look at uploading big data and small data the command to upload any data big or small from the local file system to hdfs is hadoop space fs space dash copy from local space and then the source file address space and the destination file address now let's look at the steps of building a mapreduce program first determine if the data can be made parallel and solved by using mapreduce for example you need to analyze whether the data is write once read many or worm type data in nature then design and implement a solution as a mapper and then reducer class within your code compile the source code with hadoop core and package the code as a jar executable configure the application job as the number of mapper and reducer tasks and to the number of input and output streams then load the data or use it on previously available data and then launch and monitor the job you can then study the results and repeat any of the previous steps as needed the hadoop mapreduce requirements the user or developer is required to set up the framework with the following parameters the locations of the job input in the distributed file system the location of the job output in the distributed file system the input format to use the output format to use define a class containing the map function and then a separate class containing the reduce function which is optional if a job does not need a reduced function there is no need to specify a reducer class in your code the framework will partition the input schedule and execute map tasks across the cluster if requested it will sort the results of the map task and it will execute the reduced tasks with the map output the final output will be moved to the output directory and the job status then reported to the user set of classes this image shows the set of classes under the user supply and the framework supply the user supply reversed to the set of java classes and the methods provided to a java developer for developing hadoop mapreduce applications the framework supply refers to defining the workflow of a job which is followed by all hadoop services as shown in the image the user provides the input location and the input format as required by the program logic once the resource manager accepts the input a specific job is divided into tasks by the application master each task is then assigned to an individual node manager once the assignment is complete the node manager will start the map task it performs shuffling partitioning and sorting for individual map outputs once the sorting is complete the reducer starts the merging process this is also called the reduce task the final step is collecting the output which is performed once across all the individual tasks once they're completed this reduction is based on programming logic let's look at mapreduce responsibilities the basic user or developer responsibilities of mapreduce are one setting up the job two specifying the input location and three ensuring that the input is in the expected format and location the framework responsibilities of mapreduce are distributing jobs among the application master and node manager nodes of the cluster running the map operation then performing the shuffling and sorting operations next are the optional reducing phases and finally placing the output in the output directory and informing the user of the job completion status we create a new project let's see how we would create a new project first make sure eclipse is installed on your system once it's installed you can create a new project and add the essential jar files to run a mapreduce program then to create a new project click the file menu select new project or alternatively press control n to start the wizard of a new eclipse project the screen shows the new project and select wizard options for the first step in step number two you would select a java project from the list then click the next button to continue step 3 the newly created project has to have a name in this case we'll type the project name as word count and click the next button to continue in step number four of your new project you will now include jar files from the hadoop framework to ensure that the programs locate the dependencies to one location and in step number five you will add the essential jar files to locate these go to the libraries tab and click the add external jars button to add the essential jar files after adding the jar files click the finish button to complete the project successfully next we'll check the hadoop environment to ensure we have mapreduce it is important to check whether the machine setup can perform mapreduce operations to verify this use the example jar files that are deployed by a hadoop installation this can be run by running the command shown on the stream before executing this command ensure that the words.txt file resides in the data first location in the onscreen example you see the hadoop jar command being executed passing it the hadoop examples dot jar file set word count and then data first slash words dot txt as the input and data first slash output as the selected output advanced mapreduce hadoop mapreduce uses data types to work with user given mappers and user given reducers the data is read from files into the mapper and emitted by mappers to the reducers the processed data is sent back by the reducers data emitted by reducers goes into output files at every step data is stored in java objects let's now understand the writable data types in advanced mapreduce in the hadoop environment all input and output objects across the network must obey the writable interface which allows hadoop to read and write data in a serialized form for transmission let's look at hadoop interfaces in some more detail the interfaces in hadoop are writable and writable comparable as you've already seen a writable interface allows hadoop to read and write data in a serialized form for transmission a writable interface consists of two methods read and write fields a writable comparable interface extends the writable interface so that the data can be used as a key and not as a value as shown here the writable comparable implements two methods compare to and hash code output formats in mapreduce now that you completed the input formats in mapreduce let's look into the classes for the mapreduce output format the first class is default output format which is text output format it writes records as lines of text each key value pair is separated by a tab character this can be customized by using the mapreduce text output format dot separator property the corresponding input format is key value text input format sequence file output format writes sequence files to save on output space this represents a very compact and compressed version of normal data blocks sequence file as binary output format is responsible for writing key value pairs that are in raw binary format into a sequential file container and map file output format writes map files as the output the keys in a map file are added in a specific order the reducer then emits keys in that sorted order multiple text output format writes data to multiple files whose names are derived from the output keys and multiple sequence file output format creates output in multiple files in a compressed form let's look at distributed caching a distributed cache is a hadoop feature to cache files that are needed by the applications a distributed cache will help boost efficiency when a map or reduced task needs access to common data it allows a cluster node to read the imported files from its local file system instead of retrieving the files from other cluster nodes in the environment it allows both single files and archives such as zip and car.gz it copies files only to slave nodes if there are no slave nodes in the cluster then distributed cache copies the files to the master node it allows access to the cached files from mapper or reducer applications to make sure that the current working directory is added into the application path and allows referencing of the cache files as though they were present in the current working directory vastly speeding up access using distributed cache step one first set up the cache by copying the requisite files to the file system as shown here here we see a bin hadoop fs command using a dash copy from local of the file lookup dot dat to hcfs myout lookup.dat this shows us that there is currently no file or directory of that name the hadoop fs copy from local remember will take a file from your local file system and place it in the target directory within the hdfs file system using distributed cache step 2 set the application's job conf as shown in the example in this case we're setting up a new instance of job by creating a new instance of job conf we then will call a distributed cache add cache file method and create a new uri specifying the location in this case my app lookup.dat with the filename of lookup.dat in the same way each of the different commands here shows you creating a cache entry for zip files jar files tar files tgz and gz files in step three of setting up your distributed cache you will use the cache files in the mapper or reducer class that you create once the private path and configure information is in your program you simply declare an instance of file called f specifying a new file along with the parameter of dot map dot zip slash sum file in zip dot text this will map your file into the distributed cache it joins in mapreduce joins are relational constructs that can be used to combine relations in mapreduce joins are applicable in situations where you have two or more data sets you want to combine a join is performed either in the map phase or later on in the reduce phase by taking advantage of the map reduce sort merge architecture the various join patterns available in map produce are reduced side join replicated join composite join and cartesian product a reduced side join is used for joining two or more large data sets with the same foreign key with any kind of join operation a replicated join is a map side join that works in situations where one of these data sets is small enough to cache that's vastly improving its performance a composite join is a map side join used on very large formatted input data sets sorted and partitioned by a foreign key and lastly a cartesian product is a map side join where every single record paired up with another full data set this style of join typically takes a significantly longer period of time to execute a reduced side join works in the following ways the mapper first prepares for join operations it takes each input record from every data set and emits a foreign key record pair the reducer then performs a join operation where it collects the values of each input group into temporary lists the temporary lists are then iterated over and the records from both sets are now joined a reduce side join should be used in the following conditions when multiple large data sets are being joined by a foreign key or when flexibility is needed to execute any join operation or when a large amount of network bandwidth is available as will be moving data across the network and also when there is no limitation on the size of data sets the sql analogy of a reduced side join is given on the screen in the output of a reduced side join the number of part files equals the number of reduced tasks so if you have 10 reduced tasks you will have 10 separate part files replicated joins a replicated join is a map only pattern in other words does not use the reduce phase and works as follows it reads all files from the distributed cache and then stores them in in memory lookup tables the mapper processes each record and joins it with the data stored in memory there is no data shuffled to the reduced phase the mapper gives the final output part this type of join is typically very quick replicated joins should be used when all data sets except for the largest one can fit into the main memory of each map task that is limited by the size of your java virtual machine or jvm heap size when there is a need for an inner join or a left outer join with the large input data set being the left part of the operation a sql analogy of this type replicated join is given on the screen in the output of a replicated join the number of part files equal the number of map tasks and again as this is using memory as one side of the join it is typically much faster a composite join is a map only pattern working in the following ways all data sets are divided into the same number of partitions each partition of dataset is sorted by a foreign key and all the foreign keys reside in the associated partition of each data set two values are retrieved from the input tuple associated with each data set based on the foreign key and the output to the file system this type of join is typically very lengthy and depending on the size of your data sets can run for a very long time the composite join should be used when all data sets are sufficiently large and when there is a need for an inner join or a full outer join a sql analogy of a composite join is displayed on the screen in the output of a composite join the number of part files equal the number of mapped tasks the cartesian product a cantigia product is a map only pattern that works in the following ways data sets are split into multiple partitions each partition is fed to one or more mappers for example in the image shown here split a-1 and split a-2 are fed to three mappers each a record reader will read every record of input splits associated with the mapper and the mapper simply pairs every record of a data set with every record of all other data sets the cartesian product should be used when there is a need to analyze relationships between all pairs of individual records and when there are no constraints on the execution time as these can take a long time in the output of a cartesian product every possible tuple combination from the input records is represented hi there if you like this video subscribe to the simply learn youtube channel and click here to watch similar videos turn it up and get certified click here

Info

Channel: Simplilearn

Views: 55,020

Rating: undefined out of 5

Keywords: what is mapreduce, what is mapreduce in hadoop, what is mapreduce and how it works, what is mapreduce with example, mapreduce in hadoop, hadoop mapreduce, hadoop mapreduce tutorial, mapreduce tutorial, mapreduce tutorial for beginners, mapreduce explained, mapreduce execution pipeline, mapreduce explained simply, mapreduce technique, mapreduce types, mapreduce hadoop explained, hadoop, hadoop tutorial, hadoop tutorial for beginners, simplilearn hadoop, simplilearn

Id: b-IvmXoO0bU

Channel Id: undefined

Length: 35min 6sec (2106 seconds)

Published: Tue Feb 25 2020