Dealing With Big Data - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everybody's talking about big data many different fields like finances physics biology medicine and many more they're all talking about big data and all you can do uh with it right but my question is are they really doing big data in a previous video my opinion student review was talking about what is big data and trying to define it and one of the main issues we have at the minute is that there is a standard definition of what big data actually is also another issue is the many phases that big data may have so for example you might be thinking of how to acquire that data right so installing different sensors and how to get that data and how to store that data that's a different aspect to it right so the computational infrastructure how to keep it secure and the area i'm more interested in is how to analyze how to mine how to try to extract knowledge of that big data using machine learning and data science techniques how much data is big data how many megabytes gigabyte terabytes so it's actually really difficult or almost impossible to answer uh that question there isn't a standard definition but there is one definition i want to read out for you because i find it quite quite good in my in my opinion big data involves data with volume diversity and complexity requires new techniques algorithms and analysis to extract valuable knowledge which is going to be typically hidden right so if you are having a really big spreadsheet let's say for example covet cases right that kind of case that oh i don't have any other column in that expression it's not big data yeah not quite big data if you're using a very uh cool machine learning algorithm from the scikit-learn library from with python it might not be big data either right so is it 20 megabytes is one terabyte this all depends on what you need to do with that particular data set right so you might be using a very simple algorithm to do account so for that you might have a really huge data set and you can do it with your with a laptop so it might not be even big data but if you need to run a genetic algorithm to find out uh whatever and optimize whatever parameters for a deep learning algorithm well they might it might be that the data it cannot be that big anyways right because you are not going to be able to do it with one computer so simply put big data in my opinion is when you need more than one computer to be able to deal with the data it's also very important to remember that what we call big data today might be almost nothing tomorrow right so computers are evolving and we have more and more memory every day and the ability to process and compute more and faster so you might think today one gig of data might be a lot but maybe that tomorrow is almost nothing right so the definition cannot be just precisely an amount of data it's not about the volume it's much more than that it's when you need that answer and how you need that analysis of what kind of analysis you need to be doing so sean how much data do you think we have at the minute in the world in the world yeah i thought you meant on this uh sd card that i've been recording what happens after terror is it better what happens after yeah and then you got more and more and more right so i think there's going to be a lot of zeros on there a lot of seedless there so at the beginning of 2020 the digital universe was estimated to be 44 zettabytes of data but i'm not so sure how accurate that figure is right but what it means is we are actually collecting data all the time and somehow we are a little bit under the impression that having more data and having data is the new kind of the new goal and everybody wanted it right and is that for sure something that we need do we really need to collect data for the sake of it and that might be something for another video but that's something else how sustainable is to keep storing data just for for the fun of it but what i want to do today is to okay let's say we want to store data how do we do it how we can analyze that data it's about the quality of the data right it's all about the quality it's not about the quantity right so sometimes the problem is we don't know where the quality is and therefore we need to gather the data collect it and then use pre-processing techniques to be able to shrink that data and just get that that is of good quality some people call it smart data it's like running for gold isn't it yeah exactly exactly all right so how do we deal with big data so let's say shown that you've got you know you've got a data set and you want to do something and you're running some machine learning and you've got your computer at home right and you're a very lucky guy and you've got 16 gigs off run and you've got one terabyte of drive and you're running there your favorite machine learning algorithm let's say that you're running random forest you're trying to do some things and analysis with your data you've got here your data and you're trying to run random forest and all of a sudden what you get is out of memory right what do you do uh well you restart yeah oh sorry i've been using windows too long that's the problem if you're if you're using windows that's your own mistake right so you should be using something else to make sure that you don't get that message but let's assume that it's not the problem a good computer scientist would probably say i'm going to try to optimize the memory utilization of my algorithm but probably you're using some library to run this random forest right so you don't really know the implementation details of that particular library so you try to run that and you say oh damn it i think i will probably need a little bit more memory so what you do is you go and buy some more and then you say i'm gonna get 64 gigabytes of ram then you run it everything goes well happy days everybody happy and this is what we call scale up or vertical scalability so scale up vertical scalability is real good when it works right so if you can just simply buy a little bit more memory if you have enough money that's all good but there will be a limit you will not be able to go above that limit right so 64 what if then later on that data set becomes even bigger and then you try again and you say out of memory again are you gonna go and buy 128 or 256 i mean if you're rich maybe you want to try but you might not be able to do it right but whenever possible if you're gonna stick to this approach please by all means scale up all the time that's what we want to do because it will be more efficient in terms of energy as well consumption and anything what is the solution if we cannot do it with one single computer well that sure is when you bring in a second or third or fourth ah exactly so rather than using one single computer what you're going to do is to use a few of them how do we call this a server phone scale up this is gonna be scale out or horizontal scalability right so we're gonna try to spread the computational cost of whatever analysis whatever algorithm you're running across a number of computers so basically a divide and conquer approach and you might be thinking oh that's fantastic and so what is the problem with that that's brilliant right yeah but there are a few issues you need to bear in mind so the first thing is here when you're using one single computer you're going to be able to use your random forest from the scikit-learn library right straight away or from any programming language you're using but if you've got a few computers you need to bear in mind they are all independent they all have their own hard drive for solid state drive they have their own run memory here right and and they can be helping to solve the analysis that you want to do but you need to explicitly and in some way design that distributed computing and that is big data whenever we need to distribute the computing we need to think how to use multiple computers to be able to use it the problem is you've got to use a network for communication across those computers and that is actually really expensive and a network equipment is expensive and also it becomes the main bottleneck sometimes because we're gonna be moving data around all the time right so another issue of this and it doesn't come for free to doing this to be doing this scale out is space you need more computers and also energy and actually you might be thinking they're they're just standard desktop computer but they're not they're normally well you know racks of computers and you might be thinking isa you're talking about hpc right high performance computing yes that's what i'm talking about right so yes you can put all of them you have to put the network it's going to be more costly but it allows you to do quite a few things the first thing that allows you is if the data keeps growing you simply need to add in one more machine to the mix right so you simply put one more machine and you're able to cope with more and more data as you go which is good and it's going to be cheaper than upgrading um your own computer there is one more thing that is quite interesting from the big data point of view to do a scale out so imagine you have really powerful machine you're running your deep learning algorithm there right and but it takes a little bit long because it's a lot of data and it takes let's say 18 days to get that sorted and after day 17 something goes wrong crush it and you need to start off from the very beginning if you were able to design a distributed solution to run the same algorithm if one of these computers crashes you can still restart that operation in that in there and manage to get the result faster than with one single computer and that's what we call in big data the ability to deal with faults right so it's full tolerance that is one of the key aspects we want to keep in mind when we are designing distributed solutions for uh big data all right so you might be saying okay i said what you're talking about here is high performance computing right and that's no surprise you're dealing with loads of data and therefore you need to you really really need to use multiple computers to do so and that's very true but traditionally those hpc's you're going to have multiple computers each one of them with their own hdd or solid state drive their own ram memory their own operating system as well typically it's going to be a linux server as well they are connected through a network and also there is one more network that allows you to have a central storage so typically what you normally have or what we used to have traditionally in hpc is a relatively small input data right so you've got your input data here this small square that i'm drawing here and what's going to happen is that this one will be loaded in main memory in all of those computers in there right and then what they do is they do some sort of computation here and what they used to do this kind of imagine for example any optimization algorithm trying to solve some big optimization problem or simulation and stuff like that what they do is they use this infiniband to share those computations across computers to solve the bigger problem but what happened if instead of having this small piece of data you've got loads of this what do you think is going to happen sean well that data's going to get shifted around a lot yeah so now the problem with big data that i also like to call big data data intensive applications is that they're going to spend most of their time reading and writing right so they will need to read from here and load this one here so this one here will go here then here then here and the same thing so what's going to happen is this network will crash it will not be able to cope with the amount of data and that is moving across the network so when you have a big data application a traditional hpc cluster will not do the trick we need something else what do we need bigger hard disks yeah well actually we do have a really huge central storage which is normally really expensive so what is the solution we've got the hdds here all this well maybe if you're lucky solid straight drive and there right so what we could do is to try to make a better use of those and that's what it makes instead of a hpc cluster it's going to be a big data cluster so a big data cluster will look exactly the same but now we're going to be using the hard drives then representing down here so we're gonna have multiple computers i'm gonna just throw three of them each one of them with their own hard drive and here what you can do is to spread the data across the different computers imagine a really big data set well the hello world for big data is always counting the words that you've got in a very big file so imagine i have a big file um and i'm gonna try to split that into well let's say six pieces right so i have here i'm going to have the first chunk the third one the fourth the fifth and one again what you want to do is to keep all the computation happening just locally in those drives so what you want to do is to get that big data set spread across a number of computers rather than having it in a central storage disconnected by networks so this is not a good idea in big data you want to keep it locally and something i i did on purpose and do you notice i have this number one here and this number one here representing for example just a chunk of the same file why do you think i did this was it for fault exactly exactly exactly imagine that this machine crashes so this piece of the data we want to make sure is somewhere available so you can still process it right so that is obviously what you want to do so this is basically a way to create a distributed file system a file system that will transparently and by that i mean you don't really need to care about the details underneath in that file system you simply say this machine says i want to access a or i want to access one i just access it and i have it locally and that is what we call in big data the principle of data locality try to keep the data you're going to be using locally in every single machine and that's what distinguishes a big data cluster from standard hpc cluster right so that's the key thing uh to to make it work and and that is actually the main motto of this mapreduce paradigm that rebecca actually explained in another video which is all about the idea of moving computation is cheaper than moving computation and data at the same time what you want to do is to apply an operation yeah through all the data the how you don't really care about which machine is doing it you don't really care about it either you simply want to run your analysis and you want to get it done quickly and efficiently and for that you want to avoid input output data going through the network and that's what you want to do using a big data technology like apache hadoop and apache part we will see in future videos it would put this into what's called a key value pair so we're going to take each word as the key so for each word within this we'll map it so that the word's the key and then we put the number function through the use of tree gaussians and giving me this three peaks in the distribution showing
Info
Channel: Computerphile
Views: 58,480
Rating: 4.9002643 out of 5
Keywords: computers, computerphile, computer, science, Computer Science, University of Nottingham, Isaac Triguero, Big Data, Mapreduce, HPC, Big Data Cluster, Apache
Id: v6NSdySahWc
Channel Id: undefined
Length: 16min 15sec (975 seconds)
Published: Fri Aug 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.