Data Valuation Explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what you see here is one of the simplest predictive tasks from statistics and machine learning a linear regression model we have a set of points which we try to describe by a single linear function the green line in this case and we try to find the best line that minimizes all deviations between the line and data points that we have usually the objective function for this minimization problem uses the sum of the least squares of Errors a single linear regression is just a static picture a single snapshot so this animation is made of a series of regression models for different sets of points we are adding new points to this data set one by one in total we are adding up to 50 points and we see how the Dynamics of this linear regression model Works what happens when we get larger and larger data sets solving linear regression models is not a very interesting and novel Topic in itself there are many books and tutorials about it available on the internet however linear regression is a perfect example to illustrate another concept of data valuation my name is Andre I'm researcher studying Power Systems but today I want to discuss data valuation and data marketplaces I want to demonstrate how these Concepts work and why this new emerging field of research is impacting all other fields so let's start with the question of what is data valuation the point is that usually when we we perform some data driven uh method or machine learning approach we assume that we have a lot of data and we can use all of these data to train our model and to solve our predictive task but unfortunately in reality we don't often have all data that we need and we need to buy these data so we need to gather different data sets and then understand how much can we pay for them and in order to do this we need to perform data evaluation so we need to understand the value of each data set and this is actually quite an unusual question and I would say that this is outside the box question because we are not only asking how can we use this data and how can we create data driven models but we are asking another question of how can we get more data how can we incentivize different players and agents to provide this high quality data and what is actually the value of data for our specific task uh how can we calculate the payments for these data sets and now in 2024 this is becoming a very hot topic there are dozens of uh new manuscripts uh uploaded on archive every week there are some new Grant applications to explore this topic and I even saw some startups where people develop small it companies and they try to help big companies interpret their data sets and understand how they can trade their data many papers have already proposed different designs for data marketplaces and mechanisms for data valuation but actually most of these papers have the same idea that we have data buyers and data Sellers and we need to somehow estimate the value of data to calculate the payments for data sellers to illustrate the main idea of data marketplaces I want to show you one picture this is a figure from quite an impactful paper by MIT researchers published in 2019 uh the paper by Agarwal D and sarar and the paper title bu a Marketplace for data and algorithmic solution and in this paper the authors discuss how can we develop data marketplaces what should be the pricing mechanism and the Dynamics of such Market is visualized like this so even though there are many components uh proposed in the paper the main thing is that you can see on the left here we have buyers buyers have certain prediction task and they need to buy data to improve uh this task and on the right we see Sellers and data sell sellers they have data sets that they would like to sell at Marketplace and to perform this data valuation we have pricing mechanism which is connected to the marketplace so both buyers submit their predictive tasks sellers submit their data sets and then this Marketplace estimates the payments for useful data sets and even though the main idea of data marketplaces is very simple we just sell data sets right it is not uh easy to implement them in reality there are several fundamental problems and challenges to data valuation and again using this paper by Agarwal D and sarar I want to show you what are the major challenges in data evaluation so this is what the authors say they say the challenges in creating such markets stem from the very nature of data as an asset and the very first problem is that data is freely replicable which means that if you have a data set you can sell it once then you can make a copy of this data set modify it a little bit and sell it again and again and again so data is almost infinite and how can we price something that can be copied many times and this is a major fundamental problem of data markets and data valuation the Second Challenge is that the value of data is inherently combinatorial due to its correlation to other data so let me explain this the value of a certain data set depends on how many other data set we have and uh how data behaves in combination with other data sets and we cannot just understand what is the value of certain data set uh before we analyze it in combination with other data sets and it is very unique uh attribute of data markets because we don't see this in other markets in other economic problems the last challenge that the authors of this paper mention is that the value of data cannot be estimated a prior before applying it to a certain predictive uh or machine learning task and again this is very unique to data markets because in all other existing traditional markets we usually know the value of products uh of assets that we are going to sell or buy let's say we're buying some food or we're buying electricity uh we know that one megawatt of hour has certain cost certain price and certain value and that we need this megawatt hour of electricity to run some equipment and we will get some profits and so on and so forth in case of data we cannot say what is the value of a certain data set before we sell it and before we apply it to machine learning task or something else and only then the true value of this data set can reveal itself so all of these features all of these challenges of data evaluation are very important and that's why nowadays uh economists mathematicians and data scientists are working on this problem how to perform data valuation effectively and how to develop these uh data marketplaces before coming to examples and theory of data evaluation I want to mention that quite similar problems are already happening in machine learning the problem is that some authors call them not data valuation but they call uh this topic interpretable machine learning and there exist uh many books and papers where uh scientists explore how to make machine learning interpretable and actually they are studying similar Concepts here uh because they want to see how valuable how effective are certain part of machine learning models or how valuable certain data sets are for training uh machine learning models so I just want to let you know that there are different terminologies used uh some scientists call it this problem data evaluation other scientists call it interpretable machine learning but actually these topics are closely related and there have been already many many useful papers published in this field and I will share some useful links in the description all right as I mentioned there are many resources about data valuation uh but what I noticed is that usually when people discuss data valuation and data marketplaces they discuss very complex models sometimes uh complex mathematics and theories and it is difficult for many people to get involved in into this topic because it seems that you have to be smart enough at the level of those people and those complex models to even start studying data valuation and uh which is not true in instead in this video I want to make a gentle and very simple introduction to data evaluation and I want to show you why this is a very useful and fascinating topic uh I want to use very simple illustrative examples uh that I coded myself actually I believe those examples are too simple to be even published in a journal paper and based on this example I will show you how data valuation actually works and I hope that this will be useful in your research and your projects all right let's get started first I want to discuss a very uh simple intuitive solution to data evaluation we discussed that we have certain data sets uh in the context of linear regression we have these points that we want to approximate and we want to understand the value of certain data sets or certain points so a simple solution here could be just to discard one of the points just to remove it from a data set and check what happens to our let's say machine learning task and this will be the impact uh this will be the value of this data point and then we can judge on the payments for this data and actually this approach is used in the literature uh sometimes it is called leave one out and there are many papers uh published about similar approaches for example a very famous paper from 2016 uh called uh why should I trust you explaining the predictions of any classifier and in this paper the authors they developed a method called local interpretable model agnostic explanations and basically this method is closed to leave one out because they are removing certain features they're removing certain data sets and trying to evaluate what is actually the impact the value of this data and this paper became quite popular since 2016 it was already cited more than 15,000 times uh so one might think that why do we need any other Theory here we can just use the leave one out approach we can discard certain features we can remove certain data sets and then we can calculate directly what is the value what is the impact of those data sets but actually I want to show you that there is much more complexity to it and I want to show you that leave on out approach may not work for data valuation and it may be not enough for developing data marketplaces so what I did for this video I created another animation of linear regression but this time I created a linear regression model that doesn't see the first data point of the data set that we have so basically for that linear regression model the first data point is discarded it's removed and let's see if we can estimate the value of the missing data point if we can estimate how valuable it is for our task so the new animation looks like this we see here data points appearing on the screen and the green line corresponds to the previous animation at the beginning of the video uh basically this is the linear regression that sees all data points that we have on the screen and it is the most accurate optimization problem here now we also have another line that dashed purple line and the difference is that this line corresponds to linear regression that doesn't see the first data point and you can see actually this data point in the middle of the screen this uh orange circle all other data points have this purple crossed Mark which means that they are including included in the second uh linear regression model as well so the difference between these two linear regressions is just one data point well can we estimate the value of this data point this depends on how many data points we already have in this data set let's say that with data sets more than 15 or 20 data points if we compare these two lines the green line and the purple dash line basically they are the same so we cannot uh see much difference between them and therefore we can say that there is no impact of the missing data point and therefore the value of the missing data point is marginally small it's close to zero all right in this way by removing this single data point we proved that its value for the linear regression is extremely small and this data point is not valuable but if we repeat this simulation for other points if we we'll be removing one by one other points from the data set we will also see that their value is also close to zero it seems that each single point in this linear regression problem is not valuable for the solution so the question is which points are valuable then and how can we pay to these data sets provided if none of the points is valuable well actually each point is quite valuable if we consider small combinations of data points in the beginning of this animation when we have just a few points on the screen we can notice that there is a huge difference between the new purple line and the original Green Line which means that uh the missing data point has a lot of impact on the outcome of the linear regression model and uh it can be a positive impact it can be a negative impact and we can measure it to estimate payments uh for these data point now let me highlight a very important principle in data valuation some data may be very valuable when we do not have many other data sets but at the same time any data set is negligible its impact is extremely small if we have a lot of other data sets and we will see this principle working in all other simulations today and in all other papers in this topic let's discuss what we just saw in this animation we tested the leave one out approach where we were removing some data points points and we wanted to measure the impact of these data points on the modu performance we can say that we were measuring uh contributions of these data points to the prediction so let's keep this in mind we have data providers data Sellers and we can use their data uh to improve our prediction task our machine learning model and so on and uh we want to estimate the value of these data sets and we just discussed that if we analyze the value of data points data sets in combination with many other data sets the value is marginal it's close to zero if we have very few data sets the value of new data sets can be quite significant so to properly analyze the value of certain data we need to analyze it in combination in all possible combinations with existing data sets and actually it is a very complex combinatorial problem because the number of possible combinations is 2 to the^ of n where n is the number of data sets that we have uh in the case of linear regression the number of data points and it's quite tricky how to solve these problem how can we analyze possible contributions possible combinations of uh data sets and it's not clear at all how to solve it now let me tell you about the most fascinating part of data valuation uh the most interesting part that I find in this topic the point is that it turns out that similar problems have already been studied for over 7 20 years but not in machine learning not in statistics but in Game Theory specifically in Cooperative game theory in Game Theory uh you have certain players joining coalitions to get mutual benefits and it turns out that when we have end players and they join coalitions we have to analyze two to the part of end possible coalitions the same combinatorial problem and then the question asked in Cooperative game theory is how to allocate these benefits how to allocate the benefits of cooperation among the players and this is also a very challenging very complex combinatorial problem but it turns out that in Cooperative Game Theory there have already been some very powerful tools developed for addressing these challenges uh there are many solution Concepts developed in Cooperative Game Theory but one of the most popular ones is the shle value the shle value is a concept for solving coalitional games originally proposed by la shapley who was a famous American mathematician and Economist he even received a Nobel Prize in economic Sciences for his works we do not have time now to discuss all Theory uh behind the shepple value but the point is that this concept is very useful because it has certain properties that help us to analyze the value of cooperating players and it helps us to fairly allocate the value of cooperation among the players there are many properties like uh two to uh similar players with similar contributions they will be allocated equal values uh useless players that bring no contributions they will be allocated nothing and so on and so forth so this is why this concept became popular in coalitional games and this is why now it can be applied in uh data valuation so what happened several years ago is that machine learning researchers realized that Concepts from Cooperative Game Theory can be directly translated to machine learn in problems and they can be used to perform data valuation and if we want to perform data valuation using uh let's say the shapley value we have to do the following steps so we have data sellers uh uh that have data sets and we can call them players then we have combinations of these data sets and we call these combinations coalitions finally we have impact of uh these data sets on the model performance and this impact we call contributions and our goal here is to understand the value of data the value of these data sets so uh in game theoretic terms we want to find the allocation of uh benefits the allocation of value of Cs among these data sets and if we formulate data valuation problem like that we can use very powerful Concepts from game theory for example the shle value and this is why I find the topic of data valuation fascinating because for many years some people uh uh were saying that game theory is not very useful science it cannot be applied to real world problems it's just a theory but now we see that game theory can be used to solve uh machine learning problems and it can be used for data evaluation and for developing data marketplaces and I think this is fascinating how two quite different uh fields of science machine learning uh mathematics and Game Theory now they are working together to solve these new modern problems so now let me show you what the shly value is and how it works for simple cases let's come back to the linear regression example but let's uh consider very few points let's consider that we have just three data sellers here and also one thing I assume in this simulation that each data seller provides at least two points not just one point to make it easier for you to analyze this picture I connected data points provided by the same data seller so you can see two points provided by uh player one two points provided by seller two and two points by seller three all right so we solved this linear regression model for the six uh data points provided we found this optimal line now we want to estimate the value of these data points and we want to estimate some payments for data sellers how can we do that so let's try to apply the shle value but first we need to analyze different combinations of these data sets different coalitions in this case we have 2 to the power of three which is eight possible coalitions and for each Coalition we have to run uh simulations again we have to solve the linear regression uh problem again and we have to see the impact uh the the performance of this linear regression for each of the coalitions and then we can understand marginal contributions of data set to possible coalitions after this we can compute the shap value so so the shle value formula looks like this uh first it can be uh terrifying because we have a lot of things here but actually it's quite a simple formula because it is just a linear summation of different marginal contributions so these brackets on the right that you see uh our marginal contributions we see the value of coition s with player I minus the value of CIS s which means without player I uh and this difference means marginal contribution performance of linear regression with uh point I minus performance without point I and we are summing all of these marginal contributions over possible coalitions the only uh difficult thing here is this fraction with factorials but if you don't really want to understand the theory of the shle value you don't have to understand this fraction and you can treat it just as a weight so basically the whole formula is a weighted summation of all possible marginal contributions and you do this summation for each player and you can then understand the value that this player is bringing to the overall performance of the model so let's see what happens for the uh linear regression example with just three data sellers in this figure I computed marginal contributions of players to eight possible coalitions and then I computed the shapley value and visualized the results using the red and blue colors so let's discuss what is happening here we see that player one and player two are in blue color and uh they are providing some negative uh contributions to the linear regression but keep in mind that in linear regression we want to minimize the sum of deviations between the line and the points so actually our problem is formulated in a way that decreasing the objective value of linear regression is good for us which means that uh if uh players are bringing negative contributions to this task this is good and that's why I am uh using blue color to highlight such players and then we have player number three who is uh making some positive contributions but actually in this linear regression setting positive contributions are bad for our overall approximation and the the shapley value sees these contributions and therefore we can highlight this player in Red so very simple Cooperative game among these three data sellers we solved it using the shly value and we can exactly quantify how useful certain data sets are and in this case probably we should pay more to data seller one and data seller 2 but not to data seller 3 well the next question is does this solution make sense and actually in this simple case we might be able to explain why data seller 3 is less valuable well uh if we consider the lines that are built between the the points we can see that data sets one and two they are quite close to the correct uh linear regression they are quite close to the dashed green line that we have and uh linear regression Based on data set 3 uh leads to somewhat displaced line with quite a lot of least Square errors and therefore we might say that yeah indeed data in data set 3 is less useful it is bringing more errors to this linear regression so in this simple case we might be able to explain why the results are like this uh in the next slide I want to show you what happens inside the shap value you don't have to do this if you use it but actually you can visualize all marginal contributions for all coalitions that you consider and you can see I visualize these contributions for three players the marginal contributions are shown by these black dots and also I like to visualize the distribution of all contributions for example using the violin plot so you can see where the contributions are located and indeed in this case we see that player one and two they have quite a lot of negative contributions which is good for our regression task and player 3 has some contributions close to zero and a few contributions which are positive which is bad for our task and the shapley value sees all of these contributions uh since it is the weighted some of the contributions and therefore the shapy value quantifies that first two players first two dat sets are more valuable for our task this was quite a simple linear regression example we analyzed how data from three data sellers uh interact with each other and we were able to explain why certain data sets might be less useful to our linear regression task than others so we might be happy with this result and we might say that we Now understand how data market works for this example but actually it is not true and I want to show you that our intuition can be sometimes very wrong so let me show you another example it is same problem linear regression but now we have uh nine data sellers so you can actually see here nine pairs of data points provided and again for this problem I solved the linear regression model for many possible coalitions I calculated the shapley value and here you can see the results here you can see which data points are more useful or less useful so in this case player number eight and number five are less useful and they should be paid less than other data providers and uh as in the previous example we can say yeah that's true indeed the the linear approximations based on these data points like data set 8 are not very accurate that's why these data points are not very valuable but actually I want you to look closely uh at some other data sets for example data set by seller number four it is located here in the Center of this figure and you may see that uh the line between data points 4 is almost vertical which means it is completely wrong linear regression which should lead to many errors and quite bad performance of the model but the shly value sees different combinations of these data points it sees many possible contributions of these data points and it concludes that actually uh even though looking badly this data might be useful and it is allocated some value so as humans we cannot really see how data interacts and we cannot always capture its true value and this is why we need such complex mechanisms as the shly value here I prepared for you the visualization of all marginal contributions for this example with nine data sellers you can see the marginal contributions as dots and their distribution as violin plots and one interesting thing that I want to highlight here is that actually the distribution of all marginal contributions for each player is very close to zero and we already discussed this when we were testing the leave one out approach that once we have a lot of other data sets the contribution of each single data point is not important that's why many contributions allocated close to zero you can notice that each player has just a few significant contributions and we should probably focus on these contributions these coalitions if we want to estimate the true value of data provided by a player because all other contributions they are close to zero and are negligible before coming to other examples I want to show you one very interesting manuscript that I recently found on archive this is paper from uh researchers in the UK and the paper is titled accelerated shle value approximations for data valuation and actually the authors do the same things that we discussed today they apply the shle value for different machine learning tasks and they see how it performs how can we approximate the shap value and so on but the interesting thing that I found is the conclusion in conclusion these authors say that the important Insight from this work is that small coalitions are good estimators of data value and large coalitions are bad estimators because of the diminishing returns property of data and we discussed this today that indeed once we have a lot of data sets the value of one of these data sets is diminishing is very marginal and we cannot really use it to estimate the value of data and therefore in this paper the authors also say that uh models of a certain medium range size are most likely to be effective basically they discussed that certain medium siiz coalitions are the most effective the most important coalitions in data valuation and again this is now confirmed by the simple case that we analyze today all right I prepared another data valuation example for for you today this is another very simple uh task from statistics and machine learning called support Vector machines uh you can see animation here I am adding more points uh in the loop to these data sets and Performing optimization again to solve the support Vector machine problem so basically we have two data sets the green one and the orange one and we want to find hyperplane that will be dividing these data sets uh such that the margin between the data sets is maximized the question is how can we characterize the performance of this task and how can we analyze the value of data imagine that this data again is provided by certain data Sellers and we need to understand how valuable data points are and how much we can pay to these data points in the previous example of linear regression we were trying to minimize the least square of errors and therefore we were considering the objective function the minimized errors as the way to characterize coalitions and the way to characterize contrib tions by data points in this problem we do not have any linear regression we have a classification problem we want to understand how to classify data points if it is a green data set or orange data set and therefore we can use this value of correct classifications to characterize coalitions of players and to analyze their contributions imagine we have 100 points here we solve the support Vector machine problem and we want to analyze value of a certain data point well we can remove this point solve the svm problem again and see how the model still performs if all uh remaining 99 points are classified correctly we can say that the value of uh this uh Coalition is 99 and if there are some misclassifications let's say five misclassifications then we say that okay the value of this performance is 94 it is decreasing since some points are now misclassified so this is one way to deal with classification problems and to analyze value of data in support Vector machines but this example with 100 points is too complex let's instead analyze a support Vector machine problem with just four data sellers so we can see that each data seller provides a pair of points one green one orange and we want to solve the support Vector machine problem and we want to analyze the value of data so so again in this case we need to analyze all possible coalitions compute contributions by players to these coalitions and then calculate the shle value I already calculated the shle value for this case so here are the results of data valuation remember that we have uh different value function in this classification problem we want to classify as many points correctly as we can in coalitions and therefore in this setting the shle value is allocating the number of correctly classified points between data Sellers and we can see that points highlighted by green color they are more valuable for these support Vector machine task and points highlighted by the red color are less valuable in this particular case two data points provided by data seller 4 are the most useful the most valuable and we can guess why because if we try to build a hyper plane between these two data points we will probably get the most accurate hyper plane with the highest number of of correct classifications and if we try to analyze some other data points and CS probably we will get less accurate hyperplanes with some misclassifications so very simple example of an svm task and we can see that she value correctly quantifies which data points are valuable and which data points are less valuable now in this figure I added few more sellers we now have nine sellers in total again I solved the Cal game calcul calulated the shapley value and we can see the usefulness we can see the value of data points and again surprisingly data set number four is the most valuable one still it leads to the highest number of correct classifications but then actually it is not that easy to explain what happens with other points for example consider Point number one it is quite far away from the remaining points and we can guess that it will lead to incorrect support Vector machine incorrect hyperplane with a lot of misclassifications but actually the shepple value estimates all possible contributions and it says that no this point is quite decent and it can still be somewhat useful for this task and there are other points that are less valuable so very interesting to see how data valuation and the shly value work for this case we have now analyzed two applications of the shly value to data evaluation problems one case was linear aggression and another case was a support VOR machine but those two examples are quite basic quite simple I coded them myself for this video and to give you more ideas about data evaluation I want to show you more complex examples uh from the literature and I found this very interesting paper published in 2022 the paper is titled sampling permutations for shle Value estimation and in this paper the authors applied the shapley value to the problem of image classification which means that uh we have a set of images and we want to train our machine learning model to correctly classify them so let me show you a few interesting pictures here you can see here a picture of an acorn and a space shuttle and then the authors they took these pictures and they split them into tiles into segments and they wanted to analyze the value of these segments so they created a coalitional game among these tiles and we can say that these uh parts of the images are now players providing data and the point is to understand which data is more important so let me read what the authors say in this paper they say yellow areas show image tiles that contribute positively to the predicted label dark purple areas correspond to areas contributing negatively to the predicted label so basically by formulating this data valuation problem the authors wanted to see which parts of the images are more important and this is a very useful and very intuitive question to ask let's see the results let's check the acorn picture first we see that the acorn itself the nut and its cup and some branches are quite useful for correct classification of this image as an acorn and some remaining things uh some green background is not that useful we can see that the shap value works for this problem and it correctly interprets the value of data the value of different segments of this picture the second picture is a space shuttle and we can see that the sh value explains that the shuttle itself the flames and some clouds are very useful for correct image classification and some remaining blue background is not very useful and sometimes it can even negatively contribute to the predicted label I believe this is a very interesting example of the shle value application to image classification but what I want to say at the end of this video is that data valuation arises not only in machine learning problems the the principles of data evaluation and data markets can be applied not only to learning or predictive tasks they can be applied to any optimization problem for example nowadays I see studies happening in Power Systems research where people apply datadriven Concepts and data valuation concepts for let's say electricity markets and basically the authors here are discussing how can we apply these data valuation concepts for forecasting markets in Power Systems because we have a lot of distributed energy resources we have wind turbines and solar panels and they produce a lot of data a lot of forecasts about their potential power consumption or power production there is no data Marketplace for energy forecasting in power system nowadays so we have to develop new Solutions on how can we purchase forecasts from different generators about demand and how can we distinguish valuable data from not really valuable data how can we calculate the payments and so on and so forth so this is a very useful topic and I see more research papers I see more research grants about incentivizing data sharing in Power Systems and any data sharing in general and I believe that this will be a very useful and impactful tool for a while the only problem is that applying such tools as the shap value is very compettion challenging because we have to solve our optimization models or machine learning models for thousands or millions of times which is sometimes impossible but fortunately there exists some tricks how to deal with these data valuation problems and maybe I will discuss them in other videos so today we discussed the principles of data evaluation we considered a few very basic examples like linear regression and support Vector machine and I showed you how the leave one out principle works then we analyzed how the shly value works and the next time you will be working in a data driven project you can ask this outof the Box question of not only how we develop uh a model how we use this data but how can we understand the value of data how can we create a data Marketplace for this kind of data and how can we calculate payments for data sellers I think this is a very interesting very exciting question I hope that today's video will be useful for you please let me know in the comments if you have any questions or if you are interested in any other topics see you soon bye
Info
Channel: ChuScience
Views: 1,078
Rating: undefined out of 5
Keywords:
Id: HXIkjGKjg-o
Channel Id: undefined
Length: 39min 47sec (2387 seconds)
Published: Wed Jan 10 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.