Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hadoop Training | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone this is sort of from area Rekha and in today's session will focus on Big Data I have rational I will also be sharing our knowledge about big data with us welcome Reshma hi sorrow and hello everyone I hope you'll all find this session interesting and informative even I hope so so let's move forward and have a look at the agenda so this is what we'll be discussing today we'll begin by understanding how data evolved that is how big data came into existence then we'll see what exactly is big data we'll understand what sort of data can be considered as big data after that we'll see how big data can be an opportunity now obviously big data is an opportunity but we know that there are no free lunches in life so we'll focus on the various problems associated in encasing this opportunity and it won't be fair if I tell you only about the problems and not the solution therefore we'll see how Hadoop solve this problems and we'll dig a bit deep and understand about few components of this Hadoop framework and finally we'll tell you about the big data and hadoop training provided by Ed Eureka we'll also see the various projects that will be a part of this course now I feel sort of it's the best time to tell the story about how data evolved and how big data came fine racemaster will move forward so sort of what can you notice you wish my see how technology has evolved earlier we had landline phones but now we have smartphones we have Android we have iOS that are making a life smarter as other phone smarter apart from that we were also using bulky desktops for processing MB of data now if you can remember we were using floppies and you know how much data it can store right then came hard disk for storing TVs of data and now we can store data on cloud as well and similarly nowadays even self-driving cars have come up I know you must be thinking why are we telling that now if you notice due to this enhancement of technology we're generating a lot of data so let's take the example of your phones have you ever noticed how much data is generated due to your fancy smart phones your every action even one video that is sent through whatsapp or any other messenger app that generates data now this is just an example you have no idea how much data you're generating because of every action you do melody list this data is not in a format that our relational database can handle and apart from that even the volume of data has also increased exponentially now I was talking about self-driving cars so basically this cars have sensors that records every my new details like the size of the obstacle the distance from the obstacle and many more and then it decides how to react now you can imagine how much data is generated for each kilometer that you drive on that car I completely agree with Eurasia so let's move forward and focus on various other factors behind the evolution of data I think you guys must have heard about IOT if you can recall the previous slide we were discussing about self-driving cars it is nothing but an example of IOT let me tell you what exactly it is IOT connect your physical device with Internet and makes the device smarter so nowadays you have noticed we have smart ACS TVs etc so we'll take the example of smart air conditioners so this device actually monitors your body temperature and the outside temperature and accordingly decides what should be the temperature of the room now in order to do this it has to first accumulate the data from where it can accumulate data from Internet through sensors that are monitoring your body temperature and the surroundings so basically from various sources that you might not even know about it is actually fetching that data and accordingly it decides what should be the temperature of your room now we can actually see that because of IOT we are generating a huge amount of data now there is one start also that is there in front of your screen so you can notice by 2020 we will have 50 billion IOT devices so I don't thinks I need to explain much that how IOT is generating huge amount of data so we'll move forward and focus on one more factor that is social media now when we talk about social media I think race might can explain this better right Reshma yes aura but I'm pretty sure that even you use it so let me tell you that social media is actually one of the most important factor in the evolution of big data so nowadays everyone is using Facebook Instagram YouTube and a lot of other social media websites so this social media sites have so much data for example it will have your personal details like your name age and apart from that even each picture that you like or react to also generates data and even the Facebook pages that you go around liking that is also generating data and nowadays you can see that most people are sharing videos on Facebook so that is also generating a huge amount of data and the most challenging part here is that the data is not present in a structured manner and at the same time it is huge in size isn't that right sort of Karnik remove the point you made about the form of data is actually one of the biggest factor for the evolution of big data so do you do all these reasons that we have discussed i have not only increased the amount of data but it has also shown us that data is actually getting generated in various format for example data is generated with videos that is actually unstructured same goes for images as well so there are numerous or you can say millions of ways in which data is that getting generated nowadays absolutely and these are just few examples that we have given you there are many other driving factors for the evolution of data so these are a few more examples because of which data is evolving and converting to Big Data when discuss about the detailed part I'm pretty sure that all of you must have visited websites like Amazon Flipkart etc and reshma I know you visited a lot of times yeah I do I suppose Rachel wants to buy shoes so she won't just directly go buy shoes she's so it's for a lot of shoes so somewhere her search history will be stored and I know for sure that this won't be the first time that she's buying something so there'll be her purchase history as well along with her personal details so there are numerous ways in which she might not even know that she's generating data and obviously Amazon was not present earlier so at that time there is no way that such user model data was invaded similarly the data has evolved due to other reasons as well like banking and finance media and entertainment etc etc so now the deal is what exactly is Big Data how do we consider a data as big data so let's move forward and understand what exactly it is okay now let us look at the proper definition of Big Data even though we've put forward our own definitions already so far avoiding to take us through it yes a Smasher so Big Data is a term for collection of data set so large and complex that it becomes difficult to process using on-hand database system tools or traditional data processing applications okay so what I understand from this is that our traditional systems are a problem because they're too old-fashioned to process this data or something no Reshma the real problem is there is too much data to process when the traditional systems were invented in the beginning we never anticipated that we would have to deal with such enormous model of the data it's like a disease infected on you you don't change your body orientation when you get infected with a disease right ratio you cure it with medicines Curtin agree more sort of now the question is how do we consider some data as big data how do we classify some data as big data how do we know which kind of data is going to be hard for us a process well sort of we have the five V's to tell us that so let's take a closer look at what are those so starting with the first V it's the volume of data it's tremendously large so if you look at the stats here you can see the volume of data is rising exponentially so now we are dealing with just 4.4 zettabytes of data and by 2020 just in three years is expected that the data will rise up to 44 zettabytes which is like equal to 44 trillion gigabytes so that's really really huge it is because all these humongous all this amongest data is coming from multiple sources and that is the second way which is nothing but variety we deal with so many different kinds of files at all once there are mp3 files videos days on CSV TSV and many more now these are all structured unstructured and semi-structured all together now let me explain you this with a diagram that is there on your screen so over here we have audio we have video files we have PNG files we have JSON log files emails various formats of data now this data is classified into three forms one is structured format now in structured format you have a proper schema for your data so you know what all we'll be there and basically you know the scheme about your data set is structured it is in a structured format or we can say in a tabular format now when we talk about semi structured files these are nothing but JSON XML and CSV files where schema is not defined properly now when I go to unstructured format we have blob files they're audio files videos and images so these are all considered as unstructured files and sorry it is also because of the speed of accumulation of all this variety of data altogether which brings us to our third V which is velocity so if you look here earlier we were using mainframe systems huge computers but less data because there are less people working with computers at that time but as computers evolved and we came to the client-server model the time came for the web applications and the internet boom and as it grew among the masses the web applications got increased over the internet and everyone started using all this applications not only from their computers and also from mobile devices so more users more appliances more apps and hence a lot of data and then you talk about people generating data our internet base not the one kind of application that strikes first in a - social media so you tell me how much data you submit alone with your Instagram posts and stories it will be quite a boast if I only talk about myself here so let's talk including every social media user so if you see the stats in front of your screen you can see that for every 60 seconds there are 100,000 tweets actually more than 100,000 tweets generated in Twitter every minute similarly there are six hundred ninety five thousand status updates on Facebook we talk about messaging there are 11 million messages generated every minute and similarly there are six hundred and ninety eight thousand four hundred forty-five Google searches 168 million emails and that equals to almost 1820 terabytes of data and obviously the number of mobile users are also increasing every minute and there are 217 plus new mobile users every 60 seconds geez that's a lot of data I you want to go ahead and calculate the total it would actually scare me yeah that's a lot now the bigger problem is how to extract the useful data from here and that's when we come to a next we that is value so over here what happens first you need to mind the useful content from your data basically you need to make sure that you have only useful peas in your data set after that you perform certain analytics or you say you perform certain analysis on that data that you have cleaned and you need to make sure that whatever analysis you have done it is of some value that is it will help you in your business to grow it can basically find out certain insights which were not possible earlier so you need to make sure that whatever big data that has been generated or whatever data that has been generated it makes sense it will actually help your business to grow and it has some value to it now getting the value of this data is one big challenge let me tell you why and that brings us to our next we which is velocity now this big data has a lot of inconsistencies obviously when you're dumping such huge amount of data some data packets are bound to lose in the process now what we need to do we need to fill up these missing data and then start mining again and then process it and then come up with a good insight if possible so if you can notice there's a diagram in front of your screen so over here we have this field which is not defined similarly this field and if you can notice here when we talk about this minimum value you see the other minimum values and when you talk about this it is it is very more than the other fields present in this particular column similarly goes for this particular element as well okay so obviously processing data like this is one problematic thing and now I get it why big data is a problem statement well we have only five is now but maybe later on we'll have more so there are good chances that big data might be even more big okay so there are a lot of problems in dealing with big data but there are always different ways to look at something so let us get some positivity in the environment now and let us understand how can we use big data as an opportunity yes a Sh'ma and I would say the situation is similar to the proverb or when life throws you lemons may lemonade yeah so let us go through the fields where we can use big data as a boon and there are certain unknown problem solved only cuz we started dealing with big data and the boon that you're talking about Reshma is big data analytics first thing with big data we figured out how to store our data cost-effectively we were spending too much money on storage before until big data came into the picture we never thought of using commodity hardware to store and manage the data which is both reliable and feasible as compared to the costly servers now let me give you a few examples in order to show you how important big data analytics is nowadays so when you go to a website like Amazon or YouTube or Pandora Netflix any other website so they'll actually provide you certain fees in which they'll recommend some products or some videos or some movies or some songs for you right so how do you think they do that so basically whatever data that you are generating on these kind of websites they make sure that they analyze it properly and let me tell you guys that data is not small it is actually big data now they analyze that big data and they make sure that whatever you like or whatever your preferences are accordingly they generate recommendations for you and when I go to youtube I don't know if you guys must have noticed it but I'm pretty sure you must have done that so when I go to YouTube YouTube knows what song or what video that I want to watch next similarly Netflix knows what kind of movies are like and when I go to Amazon it actually shows me what all products that I would prefer to buy right so how do you think it happens it happens only because of big data analytics okay so there is one more example that just popped into my mind I'll share with you guys so there's this time when the hurricane sandy was about to hit on New Jersey in the United States so what happened then the Walmart used big data analytics to profit from it now I'll tell you how they did it so what Walmart did is that they studied the purchase patterns of different customers when a hurricane is about to strike or any kind of natural calamity is about to strike on a particular area and when they made an analysis of it so they found out that people tend to buy emergency stuff like flashlight lifejackets a little bit of other stuff and interestingly people also buy a lot of strawberry pop-tarts strawberry tarts are you serious yeah and now I didn't do that analysis I Walmart did that and apparently it is true so what they did is they stuffed all their stores with a lot of strawberry pop-tarts and emergency stuff and obviously it was sold out and they earned a lot of money during that time but my question here is - people want to die eating strawberry pop-tarts like what was the idea behind strawberry pop-tarts I'm pretty unsure about it but yeah since you have given us a very interesting example and Walmart did that analysis we didn't do it so yeah so it is a very good example in order to understand how big data analytics can help your business to grow and find better insights from the data that you have yeah and also if you want to know why strawberry pop-tarts maybe later on we can start making an analysis by gathering some more data also yeah that can be possible ok so now let's move ahead and take a look at a case study by IBM how they have used big data analytics to profit their company so if you have noticed that earlier the data that was collected from the meters that you have in your home that measures the electricity consumed it is actually sending data after one month but nowadays what IBM did they came up with this thing called smart meter and that smart meter used to collect data after every 15 minutes so whatever energy that you have consumed after every 50 minutes it will send that data and because of it big data was generated so we have some starts here it says that we have 96 million reads per day for every million meters which is pretty huge this data the amount of data that is generated is pretty huge now IBM actually realized the data that they're generating it is very important for them to gain something from that data so for that what they needed for that what they need to do they need to make sure that they analyze this data so they realize that big data analytics can solve a lot of problems and they can get better business insight through that so let us move forward and see what type of analysis they did on that data so before analyzing that data they came to know that energy utilization and billing was only increasing now after analyzing big data they came to know that during peak load the users require more energy and during off-peak times that users require less energy so what advantage they must have got from this analysis one thing that I can think of right now is they can tell the industries to use their machinery only during the off-peak times so that the load will be pretty much balanced and you can even say that time-of-use pricing encourages cost severe e-tail like industrial heavy machines to be used off-peak type so yeah take it save money as well because off-peak times pricing will be less than the peak time prices right so this is just one analysis now let us move forward and see the IBM suite that they developed so over here what happens you first dump all your data that you get in this data warehouse after that it is very important to make sure that your user data is secure then what happens you need to clean that data as I've told you earlier as well there might be many fees that you don't require so you need to make sure that you have only useful material or useful data in your data set and then you perform certain analysis and in order to use this suite that IBM offered you efficiently you have to take care of a few things the first thing is that you have to be able to manage the smart meter data now there is a lot of data coming from all these million smart meters so you have to be able to manage that large volume of data and also be able to retain it because maybe in later on you might need it for some kind of regulatory requirements or something and next thing you should keep in mind is to monitor the distribution grid so that you can improve and optimize the overall grid reliability so that it can identify the abnormal conditions which are causing any kind of problem and then you also have to take care of optimizing the unit commitment so by optimizing the unit commitment the companies can satisfy their customers even more they can reduce the power outages that is they can reduce the power outages so that their customers don't get angry more identify problems and then reduce it obviously and then you have also to optimize the energy trading so it means that you can advise your customers when they should use their appliances in order to maintain that balance in the power load and then you also have to forecast and schedule loads so companies must be able to predict when they can profitably sell the excess power and when they need to hedge the supply and continuing from this now let's talk about how encore have made use of the i-beam solution so encore is an electric delivery company and it is the largest electrical distribution and transmission company in Texas and it is one of the six largest in the United States they have more than three million customers and their service area covers almost 117 thousand square miles and they begin the advanced feeder program in 2008 and they have deployed almost three point two five million meters serving customers of North and Central Texas so when they were implementing it they kept three things in mind the first thing was that it should be instrumented so this solution utilizes the smart electricity meters so that they can accurately measure the electricity usage of a household in every 15 minutes because like we discussed that the smart meters were sending out data every 15 minutes and it provided data inputs that is essential for consumption insights next thing is that it should be interconnected so now the customers have access to the detail information about the electricity they are consuming and it creates a very enterprise-wide view of all the meter assets and it helped them to improve the service delivery the next thing is to make your customers intelligent now since it is getting monitored already about how each of the household or each customer is consuming the power so now they are able to advise the customers about maybe to tell them to wash their clothes at night because they're using a lot of appliances during the data and so maybe they could divide it up so that they can use some appliances at off-peak hours so that they can even save more money and this is beneficial for both of them for both the customers and the company as well and they have gained a lot of benefits by using the IBM solution so what are the benefits they got is that it enabled on or to identifying six outages before the customers get inconvenience that means they were able to identify the problem before it even occurred and it also improved the emergency response on events of severe weather events and views of outages and it also provides the customers the data needed to become of active participant in the power consumption management and it enables every individual household to reduce their electrical consumption by almost five to ten percent and this is how encore use the IBM solution and made huge benefits out of it just by using big data analytics that IBM performed but let me just interrupt right now so since ray Smart told us in the beginning as well that there are no free lunches in life right so this is an opportunity but there are many problems to encase this opportunity right so let us focus on those problems one by one so the first problem is storing colossal amount of data so let us discuss these stars that are there in front of your screen so data is a riddle in past two years is more than the previous history in total so guys what are we doing stop generating so much amount of data I said that by 2020 total digital data will grow to 44 zettabytes approximately and this one will start that amazes me is about 1.7 MB of new information will be created every second for every person by 2020 so storing this huge data in traditional system is not possible the reason is obvious the storage will be limited for one system for example you have a server with a storage limit of 10 terabytes but your company is growing really fast and data is exponentially increasing now what do you do now at one point you exhaust all the storage so investing in youth servers is definitely not a cost-effective solution so ratio what do you think what can be the solution to this problem according to me a distributed file system will be a better way to store this huge data because with this we'll be saving a lot of money let me tell you how because due to this distributed system you can actually store your data in commodity hardware instead of spending money on high-end servers don't you agree Thoreau completely now Vito storing is a problem but let me tell you guys it is just one part of the problem let's see few more okay so since we saw that the data is not only huge but it is present in various formats as well like unstructured semi-structured and structured so you not only need to store this huge data but you also need to make sure that a system is present to store this varieties of data generated from various sources and now let's focus on the next problem now let's focus on the diagram so over here you can notice that the hard disk capacity is increasing but the distance of performance or speed is not increasing at that rate let me explain you this with an example if you have only 100 Mbps input-output channel and you're processing say one terabyte of data now how much time will it take maybe calculate it'll be somewhere around two point nine one hours right so will be somewhere around two point nine Menards and I have taken an example of one terabytes what if you're processing some zettabyte of data so you can imagine how much time will it take now what if you have a four input output channels for the same amount of data then it will take approximately 0.72 hours or convert into minutes so it'll be around 43 minutes approximately right and now imagine instead of 1tb you have data bytes of data for me modern storage accessing and processing speed for huge data is a bigger problem okay so Reshma has a very good example to discuss yeah so since you are talking about accessing the data and you told us already about how Amazon at different websites and YouTube they make those recommendations so if there was no solution for it it would take so much time to access the data the recommendation system won't work at all and they make a lot of money just for a recommendation system because a lot of people go there and click over there and buy that product right so let's consider that that it is taking like hours or maybe years of time in order to process my that big amount of data so let's say that at one time I purchased an iPhone 5s from and after two years I'm again browsing onto Amazon and since it took so much time to access the data and I've already switched over to a new iPhone and they are recommending me the old iPhone case for 5s obviously that won't work I wouldn't go there and click it because I've already changed my phone right so that will be a huge problem for Amazon the recommendation system wouldn't work anymore and I know that very smart changes respond every year so if she has bought a phone and people are recommending LCS Bora for now and someone's recommending the case so that phone after 2 years doesn't make sense to me at all yeah only it'll work if I have both the two phones at the same time but yeah I don't want to waste money on purchasing new iPhone case for my older phone so basically it won't be fair if we don't discuss the solution to these problems Rachel we can't leave our viewers with just a problem so it won't be fair what is the solution Hadoop Hadoop is a solution so let's introduce Hadoop now ok so now what is Hadoop so Hadoop is a framework that allows you to first store big data in a distributed environment so that it can process it parallely there are basically two parts one is HDFS that is Hadoop distributed file system for storage it allows you to store data of various formats across a cluster and the second part is MapReduce now it is nothing but a processing unit of Hadoop it allows parallel processing of data that is stored across the HDFS now let us dig deep in HDFS and understand it better yeah so HDFS creates an abstraction of resources let me simplify it for you so similar to virtualization you can see HDFS logically as a single unit for storing big data but actually restoring your data across multiple systems or you can say in a distributed fashion so here you have a master/slave architecture in which the name node is a master node and the data nodes are slaves and the name node contains the metadata about the data that is stored in the data like which data block is stored in which data node where are the replications of the data block captain etc etc so the actual data is stored in the data nodes and I also want to add that we actually replicate the data blocks that is present in the data nodes and by default the replication factor is 3 so it means that there are 3 copies of each file so so I'm going to tell us why do we need that replication shell reshma since we are using commodity Hardware right and we know failure rate of these hardware's are pretty high so if one of the data loads fail I won't have that data block and that's the reason we need to replicate the data block now this replication factor depends on your requirements right now let us understand how actually Hadoop provided the solution to the big data problems that we have discussed so race my can you remember what was the first problem yeah it was storing the big data so how HDFS solved it let's discuss it so HDFS provides a distributed way to store big data we've already told you that so your data is stored in blocks in data nodes and you then specify the size of each block so basically if you have a 512 MB of data and you've configured HDFS such that it will create 128 megabytes of data blocks so HDFS will so HDFS will divide the data in four blocks because 512 divided by 128 is 4 and it restored across different data nodes and it will also replicate the data blocks on the different data nodes so now we are using commodity Hardware and storing is not a challenge so what are your thoughts on it sorted I will also add one thing Reshma it also solves the scaling problem it focuses on horizontal scaling instead of vertical now you can always add some extra data nodes to your HDFS cluster as and when required instead of scaling the resources of your data nodes so you're not actually increasing the resources of your data nodes you're just adding few more data nodes when you require let me summarize it for you so basically for storing 1 TB of data I don't need a 1 TB system I can instead do it on multiple 128 GB systems or even less now Reshma one of the second challenge with big data so the next problem was storing variety of data and that problem was also addressed by HDFS so with HDFS you can store all kinds of data whether it's structured semi-structured or unstructured it is because in HDFS there is no pre dumping schema validation so you can just dump all the kinds of data that you have in one place and it also follows a right ones and read many model and due to this you can just write the data once and you can read it multiple times for finding out insights and if you can recall the third challenge was accessing the data faster and this is one of the major challenge with big data and in order to solve it we're moving processing to data and not data to processing so what it means sort of just go ahead and explain it yes a survival so over here let me explain you what do you mean by actually moving process to data so consider this as a master and these are our slaves so the data is stored in these slaves so what happens one way of processing this data is what I can do is I can send this data to my master node and I can process it over here but what will happen if all of my slaves will send the data to my master node it will cause network congestion plus input output channel congestion and at the same time my master node will take a lot of time in order to process this user monitored so what I can do I can send this process to data that means I can send the logic to all these slaves which actually contain the data and perform processing in the maze itself so after that what will happen the small chunks of the result that will come out will be sent to our name node so in that way there won't be any network congestion or input-output congestion and it will take comparatively very less time so this is what actually means sending process to data so I hope you all not clear with this I hope you even raise my square and clear all right good to hear that so let's move forward and focus on few components of Hadoop so we look at the Hadoop ecosystem so you can see that this is the entire Hadoop eco system so there are a lot of tools that we use so there you can see that we have flume and scoop and they are used to ingest data into HDFS now we have already seen what HDFS is so there is one more component which is known as yarn and you can consider yarn as the brain of your Hadoop ecosystem it performs all your processing activities and it allocates resources and schedules different tasks now apart from these components there are many other components as well so I'll just give you a brief introduction about these components we have pig and I which are nothing but the analytics tool hive is introduced by Facebook and pig was introduced by Yahoo now the language that is user is called Pig Latin and over here we use high quality language which is very similar to sequel the story behind hive is very very interesting I want to share this so basically Facebook wanted some tool in order to perform queries on the huge chunks of data so what they did they introduced height so with the help of hive they can actually use the same employees which no sequel and they can perform analytics on a huge set of data that is big data apart from that we have Sparky's will which is used for near real-time processing and in order to perform machine learning we have a component in spark itself that is called ml lip and even ma hood now when we talk about MapReduce we all know what exactly MapReduce is it is basically Java programs only that are used to process your big data now when I talk about Apache HBase now HDFS is a file system and what is HBase HBase is nothing but a no sequel database on top of HDFS now let's move forward and focus on few important components among these now Hadoop we all know what exactly Hadoop is what is hive a box a hive is nothing but a data warehousing tool that allows you to perform big data analytics using high square E and this language is very similar to sequel and I've told you the story also behind it by Facebook actually implemented now when we talk about a power chip Pig it is again an analytics tool which is used to analyze large set of data representing them as data flows spark is nothing but in-memory data processing engine that allows us to efficiently execute streaming machine learning sequel workloads and requires pass creative access to data set so basically streaming and all those things in which you require near real-time processing you can integrate spark with Hadoop what is a space it is nothing but a no sequel database present on top of your HDFS Pyne system so this is all about Hadoop and Big Data now the point is how ad Raigarh can help you become a big data and hadoop expert now let's move forward and understand the big data and hadoop training provided by Ed Eureka big data Hadoop certification training so Eddy Rica provides a structured program in order to make you a certified Hadoop developer now before I explain you the structure of the program let me tell you guys that Eddy rekha provides 24/7 support team so if you have any questions doubts at any point of time you can contact them apart from that wherever you pay for the course you get access to elements now what is elements LMS is nothing but learning management system so all of your class recordings your PDS your presentations will be there in your LMS and you get the access to LMS for life style let me tell you you get it for lifetime so once you're even done with the course and you want to take it again you can do that as well so you come back after ten years also if you want to learn Hadoop we will put you in a live patch so basically you get everything for lifetime so let's focus on the structure of the program so it starts with basics and it covers all the advanced portion of Big Data Hadoop as well so the first module you'll be learning about what exactly Big Data and Hadoop is and various concepts just the introductory module then comes the concepts of hdf and MapReduce this how does the architecture looks like in all those things and then in the third module you'll understand how to actually set up a Ducasse cluster and how this architecture looks like in the fourth module you'll be dealing with MapReduce program then the fifth module to learn data loading techniques then comes the sixth module now in six modules you will be introduced to analytics tools like Pig and height and I told you earlier as well why we use Pig in height then comes HBase which is nothing but a no sequel database on top of HDFS after that we have hoozy then we'll look at various best practices for hadoop developing now comes spark now let me tell you we won't discuss tumor about spark in this course Eddie Vega has a separate course in spot but still we have included the introduction of spark in in Big Data Hadoop Isabel you'll also learn how to work in r DD in spa and finally you'll be working on real-life project on big data analytics so once you're done with this project on the basis of how you have done it you will be getting grades in your certificate so you'll get a certificate only when you are done with the project and we have multiple projects so it's not like once you were done with one project you cannot take up the other project you can do that as well you can take multiple projects you can just request for multiple projects and definitely they'll give it to you but at least one project you need to finish in order to get the certification so we'll move forward and Rachel will give you an introduction of what all projects are part of this course yeah so these are some of the projects that you can choose to work on so the first one you can see that it is to analyze social bookmarking sites now I'll tell you a little bit about that so here you have to work with the social media data so the data here was comprised of the information gathered from sites like Reddit comm StumbleUpon com etc so these are bookmarking sites and they allow you to bookmark review rate and search various links on any kind of topic so the data is an XML format and it contains various kind of links post URLs and different categories defining it and the ratings are linked with it so what you have to do is that you have to analyze the data in the Hadoop ecosystem so that you can fetch data into the HDFS and analyze it with the help of MapReduce Pig and hives to find out the top-rated links based on user comments likes etc so this is the problem statement so you have to analyze the entire data they also post in this kind of sites and you have to find the top post according to the likes and comments so this is what you have to do similarly we have other projects like the customer complaint analysis so this is related to the retail industry similarly you have tourism data analysis which is related to some tourism data facts and you have airline data analysis the loan data set which is related to banking and finance movie ratings which is the media data that so even choose any of the projects from it so you can give it a try and come up with a solution for all these problems and if any time you get stumble upon something and you're stuck with something we have our support team which is 24/7 available you could call us anytime and they will help you with that so Reshma I was thinking how about we give a brief summary report of things we have discussed and now yes our that would be great so we should just go ahead and provide a summary so we started with how data evolved and how Big Data came into existence we saw various factors that actually led to Big Data then we focus on five B's so basically in order to consider any data as big data we need to consider these five B's and one of those five is reshma first we saw the volume of data than we saw variety velocity value and finally veracity all right fine then we focused on big data as an opportunity we discussed quite a lot of examples and I'm still unclear why people are buying strawberry pop-tarts during hurricanes but that's not the point but don't worry sorry we'll find out that answer for you all right so after those examples we saw a case study of IBM then we shifted our focus towards the problems that are associated with big data now obviously it is an opportunity but in order in case that opportunity you need to come up with a solution to all these problems and what was the solution raised well the solution was Hadoop and we have seen how to use HDFS and MapReduce in order to solve those problems and we finally we discussed about the curriculum the Hadoop curriculum in ed Eureka the kind of projects that you can choose and what all you'll be learning in this course all right fine so by this we come to the end of today's session thank you race MA for joining us it was pleasure having you in today's discussion thank you sorrow I enjoyed it a lot as well alright fine guys so this video will be uploaded into your LMS so you can go through it if you have any questions about you can ask our 24/7 support team or you can bring your doubts in the next session as well and let me tell you guys this was just an introductory video to Big Data and Hadoop the real course will start from the next session thank you and have a great day I hope you enjoyed listening to this video please be kind enough to like it and can comment any of your doubts and queries and we will reply to them at the earliest to look out for more videos in our playlist and subscribe to our at Rica channel to learn more happy learning
Info
Channel: edureka!
Views: 1,242,695
Rating: undefined out of 5
Keywords: yt:cc=on, big data, Big Data Tutorial, Big Data Tutorial for beginners, Big Data Introduction, learn big data, What is Big Data, Big Data Training, Big Data Training for beginners, hadoop, Hadoop tutorial, hadoop training, big data training videos, Big Data Hadoop, Big Data Tutorial Edureka, edureka, Big Data Hadoop Tutorial For Beginners, Big Data Analytics Tutorial, big data edureka, Hadoop tutorial for beginners, Edureka big data
Id: zez2Tv-bcXY
Channel Id: undefined
Length: 42min 33sec (2553 seconds)
Published: Tue Apr 25 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.