What is Big Data? - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Today we're going to be talking about big data. How big is big? so Well, first of all, there is no precise definition as a rule. So kind of be standard what people would say is When we can no longer reasonably deal with the data using traditional methods So that we kind of think what's a traditional method? Well, it might be can we process the data on a single computer? Can we store the data on a single computer? And if we can't then we're probably dealing With big data, so you need to have new methods in order to be able to handle and process this data As computers getting faster and bigger capacities and more memory and things that the concept of what becomes big is is changing, right? So kind of but a lot of it isn't really as I'll talk about later isn't how Much power you can get in a single computer It's more how we can use multiple computers to split the data up process everything and then throw it back like in the MapReduce framework Then we talked about the for in with big data There's something called the five es which kind of defines some features and problems that are common amongst any Big Data things We have the five es and the first three that were defined. I think these were defined in 2001 So that's kind of how having talked about four. So first of all, we've got some volume. So this is the most obvious one It's just simply how large the dataset it's the second one is velocity So a lot of the time these days huge amounts of data are being generated in a very short amount of time So you think of how much data Facebook is generating people liking stuff people uploading content that's happening constantly All throughout the day the amount of data they generate every day It's just huge basically so they need to process that in real time And the third one is variety Traditionally the data we would have and we would store it in a traditional single database. It would be in a very structured format So you've got columns and rows everywhere. He would have values for the columns these days We've got data coming in in a lot of different formats So as well as the traditional kind of structured data, we have unstructured data So you've got stuff coming like web dream cliques, we've got like social media likes coming in We've got stuff like images and audio and video So we need to be able to handle all these different types of data and extract what we need from them and the first one is value Yeah, so there's no point in us collecting huge amounts of data and then doing nothing with it So we want to know what we want to obtain from the data and then think of ways to go about that So something some form of value could just be getting humans to understand what is happening In that data. So for example if you have a fleet of lorries They will all have telematics sensors in that we collecting sensor data of what the lawyers are doing So it's of a lot of value to the fleet manager to then be able to easily Visualize huge amounts of data coming in and see what it's happening. So as well as processing and storing this stuff We also want to be able to visualize it and show it humans in an easily understandable format Oh, the value stuff is just finding patterns machine learning algorithms from all of this data see then the fifth and final one is Veracity this is basically how trustworthy the data is how reliable it is So we've got data coming in from a lot of different sources So is it being generated with statistical bias? Are there missing values if we use think for example the sensor data, we need to realize that maybe the sensors are faulty They're giving slightly off readings So it's important to understand how? Reliable the data we're looking at is and so these are kind of the five Standard features of Big Data some people try and add more. There's another seven V's a big data at the 10 meter producer I see. I'm sure we will keep going up and up They are doing things like don't like vulnerability. So Obviously when we're storing a lot of data a lot of that is quite personal data So making sure that's secure but these are the kind of the five main ones The first thing the big big data obviously is just the sheer volume So one way of dealing with this is to split the data across multiple computers So you could think okay. So we've got too much data to fit on one machine. We'll just get a more powerful computer We'll get more CPU power. We'll get larger memory that very quickly becomes quite difficult to manage because every time you need to Scale it up again because you've got even more data you to buy computer or new hardware So what tends to happen instead and all like they see all companies or just have like a cluster of computers? So rather than a single machine They'll have say a massive mean warehouse basically If you wind loads and loads and loads of computers and what this means that we can do is we can do distributed storage so each of those machines will store a portion of the data and then we can also Do the computation split across those machines rather than having one computer going through? I know a billion database records you can have each computer going through a thousand of those database records Let me take a really naive way of saying right. Ok, let's do it. Alphabetically, I'll load more records. Come in for say Zed That's easy. Stick it on the end load more records coming for P. This Y in the middle, right? How do you manage that? and so there's Computing frameworks that will help with this So for example, if you're storing data industry to fashion than this the Hadoop distributed file system And that will manage kind of the cluster resources where the files are stored and those frameworks will also provide fault tolerance and reliability So if one of the nose goes down, then it you've not lost that data. There will have been some replication across other nodes So that yeah losing a single node isn't going to cause you a lot of problems And what using a cluster also allows you to do is whenever you want to scale it up All you do is just add more computers into the network and you're done and you can get by on relatively cheap Hardware rather than having to keep buying a new supercomputer in a big data System there tends to be a pretty standard workflow so the first thing you would want to do is have a measure to Ingest the data remember, we've got a huge variety of data coming in. It's all coming in from different sources So we need a way to kind of aggregators and move it on to further down the pipeline So there's some frameworks for this. There's an Apache Capra and like Apache flume for example and loads and loads of others as well So basically aggregate all the data push it on to the rest of the system so then the second thing that you probably want to do is Store that data so like we just spoke about the distributed file system you store is in a distributed manner across the cluster then you want to Process this data and you may skip out storage entirely So in some cases you may not want to store your data You just want to process it use it to update Some machine learning model somewhere and then discard it and we don't care about long-term storage So you're processing the data again do it in disputed fashion using frameworks such as MapReduce or Apache spark Designing the algorithms to do that processing requires a little bit more thought than maybe doing a traditional algorithm with the frameworks We'll hide some of it but you need to be thinking that even if we're doing it through a framework We've still got data on different computers if we need to share messages between these computers during the computation It becomes quite expensive if we keep moving a lot of data across the network So it's designing algorithms that limit data movement around and it's the principle of data locality So you want to keep the computation close to the data? Don't move the data around Sometimes it's unavoidable, but we limit it. So the other thing about processing is that there's different ways of doing it There's batch processing So you already have all of your data or whatever you protected so far You take all of that data across the cluster you process all of that get your results and you're done The other thing we can do is real-time processing. So again because the velocity of the data is coming in We don't want to constantly have to take all the day to Detective Well produce it get results and then we've got a ton more data I want to do the same get all the data bring it back process all of it So instead we would Do real-time processing so as each data item arrives? We process that we don't have to look at all the data we've got so far. We just incrementally process everything And that's coming up in another video when we talk about data streaming So the other thing that you might want to do before processing is something called pre-processing remember I talked about unstructured data So maybe getting that data into a format that we specifically can use for the purpose we want to so That would be a stage in the pipeline before processing the other thing with huge amounts of data There's likely to be a lot of noise a lot of outliers so we can remove those We can also remove one instances, so if you think we're getting a ton of instances in and we want them she learning algorithm There'll be a lot of instances that are very very similar see an instance is say in a database It's like a single line in the database. So for HTV sensor reading it would be everything for that Lorry at that point in time CS speed directions traveling reducing. The number of instances is about reducing the granularity so part of it is saying if we store a rather than storing data for a Continuous period of time so every minute for an hour if those states are very similar across that we can just say okay for this period this is what happens and put it in a single line or we could say for example a machine learning algorithm if there's Instances with very very similar features and then a very very similar class We can take a single one of those instances and that will suitably represent All of those instances so we can very very quickly reduce a huge data set down to a much smaller one By saying there's a lot of redundancy here and we don't need a hundred very similar instances When we one would do just as well So if you've got a hundred Instances and you reduce it down to one is does not have an impact on how important those instances are in the scheme of things Yes, so techniques That deal with this stuff. Some of them would just purely say okay now this is a single instance and That's all you ever know others of them would Have yet have a waiting? So some way of saying this is a more important one because it's very similar to 100 others that we got rid of this one's really not as important because there are least three others that were similar to it so we can wait instances to kind of reflect their Importance. There are specific frameworks with big data streaming as well so there's technologies such as the spark streaming module' for apache' spark or there's newer ones such as Apache plink that can be used to do that. So they kind of abstracts away from the streaming aspects of it so you can focus Just in what you want to do a little thinking all this data is coming through very fast, obviously My limited brain is thinking streaming relates to video. But you're talking about just data that is happening in real time. Is that right? yes, so Going back to the Lori's as they're driving down the motorway. They may be sending out a sense of read every minute or so and That since the reading goes back we get all the sense readings from all the lorries coming in as a data stream so it's kind of a very quick roundup of the basics of Big Data and there's a lot of applications this obviously so Thanks, we'll have huge volumes of transaction data that you can extract patterns of value from that and see what is normal they can do Kind of fraud detection on that again. The previous example of fleet managers understanding what is going on basically any industry will now have ways of being able to extract value from the data that they have so in the next video we're Going to talk about data stream processing and more about how we actually deal with the problems that we all time data can presenters over very very large BIOS This kind of computation is a lot more efficient if you can distribute at because doing this map phase of saying, okay This is one occurrence. The letter A that's independent of anything else and see most Interested in you're probably only interested when a button is pressed or so on the only times positive

Info

Channel: Computerphile

Views: 202,623

Rating: undefined out of 5

Keywords: computers, computerphile, computer, science, Computer Science, University of Nottingham, Rebecca Tickle, Big Data, MapReduce, Cluster, Distributed Computing

Id: H4bf_uuMC-g

Channel Id: undefined

Length: 11min 52sec (712 seconds)

Published: Wed May 15 2019