Differential Privacy And The Complexity Of Simple Queries

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today to talk about privacy and the complexity of simple queries uh it's a single talk session okay uh thanks everyone for coming uh okay well echoly thanks everyone for coming so i'm really excited to be here this is my second live talk of the postcovid era so i'm still very happy to be able to give live talks um i will say first i'm doing a couple things i've never really done before in a talk like this the first is that i'm wearing a mask so if you're having trouble like hearing me or if i'm kind of mumbling or something please let me know the second is i see now that everybody likes to doodle on slides by hand so i'm kind of upping the game by doing entirely handwritten slides i've never done this before i hope i haven't been this self-conscious about my handwriting since eighth grade i hope that it comes out okay but please let me know uh if you can't read anything or if you need me to just like rewrite something or zoom in or you know whatever um right we're saying this is basically what all talks were this is what all talks were in your day right it was handwritten slides projected um my dad still has a lot of his actually sorry [Laughter] i've never personally witnessed the talk given on an overhead projector sorry elias i'll give you a recap later okay so uh the other thing i'm doing that's a little different so um this is on rigorous evidence for trade-offs between for information computation trade-offs and uh sam emailed me and said john you're really bad at making things efficient so maybe you could tell us about that and i said sure that's a great way to invite someone to give a talk so i will do that so i'm not really going to talk about like a specific paper or anything terribly recent um what i'm going to do is i'm basically going to talk about one open problem in the area of differential privacy that is uh something that's been kind of at like the the center of a lot of research so it's sort of one problem that kind of drove a lot of the research on the topic at one point and where progress is kind of stalled and what i'm going to do is i'm going to sort of like describe the problem and kind of what we know about it and how we sort of got to like that problem and then i'm going to describe what we do know about it and the kinds of approaches people have applied and sort of where they got stuck and in the process i'm not going to be able to give you a lot of rigorous evidence for information computation trade-offs i'm going to be able to sort of point you to a problem where this phenomenon currently arises and i think the people in this room and the people watching on zoom at least those who can hear me not ilias can possibly help so um as a result basically i'm gonna be sort of telling a story it's not gonna be a very technical talk and so if you ask me any questions that would be great like the more sort of interactive this is i think the more uh everyone will learn from it including me so let's begin so what's what's like the high level story of the problem that we're trying to solve so uh we're gonna start out with just like a very basic setup for like statistical estimation we have some population p we draw some sample x for today you know you can just think you can definitely just think of id samples and we want to make use this sample to make some conclusions about the population so this could essentially be anything right it could be some estimate of the mean of the population it could be some predictive classifier um some hypothesis test it's uh you know really in this level of generality it's really not important what it is and the goal of course is not to learn anything really about the sample per se the goal is to learn about the population but the sample we collect are consists of real people's data so for example um some of this stuff has come up in the context of working with the census bureau where the data set of course is you know actual people whose doors were knocked on and had to fill out a card um or users who use some you know some system like google or facebook and want to uh have that data used in a certain way so from the perspective of the person analyzing the data the goal is just to learn about the population but the way we're going to do that is we're going to use a sample which represents real people's data and so we have to protect the privacy of that data so we have to ensure that we learn about the population in a way that doesn't reveal too much about the people actually in the sample and like at first flush this sounds like a great situation because what what we're trying to do make conclusions about the population is sort of perfectly aligned with the privacy goal we really only want the sample as a means to understanding this population in fact in in some sort of sense really our goal is to avoid learning anything that is just specific to the sample and only to learn things that are true about the population right so for example i might want to learn the sort of canonical example is like i might want to learn that uh whether or not you smoke is correlated with whether or not you're going to get lung cancer at some point in your life but if i'm trying to conduct a study to find that out i really don't inherently care about any one person whether they smoke or whether they have cancer in fact my goal is not to learn about one person in the data set it's to learn about the population so you might think that there's just not even a problem here maybe a well-designed estimator or machine learning algorithm would inherently respect the privacy of the sample because in order to do a good job of learning about the population it should not be overfitting to the sample but in fact most out-of-the-box statistical methods um do reveal a lot of information about the individuals in the sample sort of in non-obvious ways so um you know i'm not talking about methods that sort of output say a representative data set where that data set obviously contains individual samples even methods that reveal something like regression coefficients or the parameters of a predictive neural net or the mean of the data set can reveal a lot of information about the samples in fairly surprising ways and often inherently do so so often there are sort of inherent requirements that any good method for inferring something about the population must in fact reveal a lot about the sample so this has been formalized or observed in like many different contexts um i'll point to a couple lines of work so the sort of whole field of differential privacy was kind of launched by a paper on something called reconstruction attacks uh by dinora nissim this is like a seminal paper that has been really influential in data privacy and what it basically showed is that for many natural estimation problems you can just sort of reverse this process or to a large extent you can reverse this process and go from some seemingly benign looking output like some some population some output that describes the population back to the sample or a large part of the sample that was used to produce it in the context of high-dimensional statistics there's been a more recent line of work on something called membership inference where you don't reconstruct the data set but you can sort of identify whether or not a person was included in the sample which can be itself a privacy violation um this sort of began with an influential attack uh it's referred to as homer at all on genomic data sets um and also in the machine learning community people have started to look at uh in some ways sort of more blatant types of privacy violations where kind of large uh neural networks will effectively remember training data so the predictive model when kind of used in the right way you can actually extract from its specific examples it was trained on so kind of you know high entropy unique examples that really like don't reflect any sort of population trend they really just reflect like one training example in the world um so that was first shown in a paper called the secret share by carlini at all but it's become sort of an active general area of research and there's been an interesting line of work on sort of to what extent machine learning in certain settings essentially requires this type of memorization so um you know the gist up to this point right is that there are there's a sort of a rich theory of attacks on privacy routes for breaching privacy from seemingly fairly benign uh statistical and the kind of conclusion is that unless you explicitly do something to rule these types of attacks out they'll probably be possible so uh the kind of subtext of a lot of this work most of these papers sort of begin with somebody proposed a heuristic to prevent privacy violations and lo and behold another attack was possible so there should really be some kind of rigorous way of arguing that not only these attacks don't work but sort of a broad class of attacks are not possible and for the past 15 years uh the kind of criteria that's emerged is something called differential privacy so the paper that introduced this was by dwarf mcsharing seaman smith in 2006 um and it's been a very active area of study and what it is is basically a certain kind of robustness condition that an algorithm can satisfy um so the idea is suppose we have a data set x so that's the top data set here and you know hypothetically you're like some person in this data set so just imagine that you are a person in this data set and you're a little worried about your privacy and we're going to run the data set through some kind of mechanism for some reason algorithms in this literature are called mechanisms uh i tried to change this and a journal reviewer told me the ship had sailed and i needed to to stop doing that so i've just given up but for some reason algorithms are mechanisms in this literature so you're going to run the data set through some mechanism and it's going to produce some conclusion whatever it is maybe something about the population but it really doesn't matter some conclusion and you're worried that that conclusion will reveal something about you but you're not just worried that it'll reveal something about you right so you're worried specifically that it'll reveal something about you because you agreed to have your data included in this computation so um this is sort of an important idea that we're going to consider a counterfactual where we ran the mechanism on the same data set except we've taken your data out of it or we've maybe replaced it with some you know dummy record or something but we basically stripped anything specific to your sample out of the data set and we want to argue that the conclusion that we reach should not depend in a significant way on which of these two cases we're in whether we use the real data set or whether we use the data set with u stripped out of it and what that means is that no attacker should be able to learn too much about you because your data was sampled okay so the fact that you were sort of one of these unlucky people who was actually included in the sample should not cause any attacker to learn too much about you and in order to get that we have to formalize a particular condition on the mechanism that says that these two worlds are indistinguishable okay and so formally this notion of epsilon delta differential privacy says that for every two data sets that differ on one person's data so not just your data but you know any one hypothetical person's data and every possible event on the output space that could arise the probability of that event when the mechanism is wrong with your data so on x shouldn't be too different from the probability of that event when the mechanism is run with uh this hypothetical data set x prime and at that level of a you know generality this just says that the distribution m of x and m of x prime have to be close in some sense but for reasons i'm not going to get into but that are important there's a very particular notion of closeness that's used in epsilon delta differential privacy and its form is really important so if you ignore this delta for a second the notion of closeness is that the ratio of the two probabilities should be upper bounded by something a little bigger than one okay so for technical reasons we typically write it as e to the epsilon but epsilon should be fairly small so think of this as like you know should increase by 10 at most for sort of technical reasons we also add this extra additive term of delta so it says for really really unlikely events we don't really care about the point wise ratio of probabilities and you should think of delta as this like very very small quantity um but the key thing is really this multiplicative ratio of probabilities so this is um what i like to think of is it's a notion of sort of distribution stability or distribution robustness it says if i change one person in the data set let's say you then the output of the mechanism shouldn't change much like in distribution not just in sort of symmetric but actually as probability distributions and this is what gives it these kind of strong semantic guarantees okay so let me go on and let me describe like a problem that's been sort of at the heart of research on differential privacy so like i described this definition there's been a lot written about why this is a good definition and what it means so of course what what would we like the mechanism to do and so a problem that's received a lot of the attention is coming up with differentially private algorithms that answer statistical queries about the population so if you've heard of the statistical queries model from kearns in 93 um this is essentially what i'm asking is can we implement the statistical queries model via a differentially private algorithm if you've never heard of this at all that's fine let me just be very concrete so we have some distribution p over some data universe u which for today i'll just think of as plus minus one to the d so you know d bits so everybody in your world answers d yes or no questions for example and that's that's your data set and we have some queries we'd like to answer and the queries have this very specific form we just want to estimate the mean of some function over the population and the function is going to be for today like a plus minus one valued function and it's not really important like some of these details are you know more just for concreteness but some bounded function so for every query qj i'll say like qj of p is the expected value of this predicate q on a random point from the population i like to abuse notation a little bit in this way and this capital q is going to say well i have a vector of k of these queries and q capital q is going to be like just the the function that concatenates the answer to all of them just to keep notation a little simple um and then i have a data set that's drawn id from my distribution p and the goal is i'm given a bunch of these queries so there's no like interaction at all there's just a set of k queries and i want to design some estimator m such that if i give it n samples from any any distribution p it approximates this capital q of p fairly well for today the notion of approximate is going to be i want to get something that's close in l infinity distance so for every query i want to get something that's within about one in a hundred you know just make a little the answers range between minus one and plus one so i wanna get some kind of non-trivially good answer for every query some of the technical statements i make are going to be like a little fuzzy and fast and loose with whether i want say l infinity or l2 or l1 you know i i wanted to give something concrete but you can study this of course in like lots of different metrics you know you can look at different regimes i don't want to kind of go into everything okay so this is a canonical problem this lets you do lots of things so it's like seems very simple but if i give you enough queries you can do all kinds of things so let's look at just a couple of quick baselines for what you can do in this model so the obvious baseline if you don't care about privacy is just to use the empirical mean so just pretend your data set is the population and uh just look at the mean of the queries on the data set and return that as if it were the population now obviously a lot of the you know talks in this workshop are about how hard it can be to infer something about the population given samples even in the absence of privacy constraints but for this problem the non-private version is basically trivial so if i get i guess the way i wrote it i need maybe log log of the number of query samples but the gist is that a very small number of samples will suffice and we want to know how does adding a privacy constraint make this problem harder so we're starting with a problem that's essentially trivial from a in for you know without a privacy constraint both information theoretically and computationally okay so there's kind of nothing interesting going on without privacy so there's a very simple baseline mechanism that does satisfy differential privacy it's what sort of emerged out of the early line of work it's like you know it's a little hard to figure out exactly which paper did it this way but the gist is that it's it's what's advised out of the day one of differential privacy so you take the empirical mean this is like a vector in k dimensions of you know it's a vector with answers to k queries and you just perturb it with gaussian noise like independent gaussian noise on any coordinate so you perturb it with like spherical gaussian noise and if you perturb it with enough noise i mean it's sort of obvious that as you crank the variance up to infinity eventually you lose all information about the empirical mean so it should become private the question is how much noise do you have to add and the answer turns out to be that the amount of noise you have to add to make this differentially private is um about square root of the number of queries over n so if there's n samples so it says as you add more samples you can add less noise but as you add more queries you have to add more noise and you know i think that uh it's not terribly insightful to like analyze this for you but the gist is that changing one person in the data set can only change the empirical mean by a small amount in say l2 norm because i'm taking an average of n people if i change one person i can only change the empirical mean um by about root k over n and so what this says is if i get about root k samples i have enough to estimate the answer to my queries privately so this is a you know a big gap so even if i had maybe put this union bound in it would be that log k samples without privacy and suddenly i need square root of square root of k samples with privacy okay so that's a very big gap but it turns out that you can do like much much better than this so i again i'm not really going to go into the algorithms exactly although i will give like a taste of how they work um in in the next part of the talk but if you imagine you're in a setting where you have way more than d queries you have a data set of dimension d but you're asking you know way more than d queries like d to the 10 or even you know exponential and d queries you can do much better so um what emerged out of sort of the second wave of differentially private algorithms starting with a i think a really influential paper by blum liggett and roth but with like many subsequent algorithms that refined their ideas and gave very very beautiful new ideas so if i give you any set of these queries to any set of k queries there's a differentially private algorithm that accurately answers all the queries given a sample of size about square root of the dimension okay so it's like it's not a direct improvement on what we had before now it depends on the dimension but when the number of queries is much bigger than the dimension this says that the sample complexity can be almost independent of k so again i'm like hiding a log factor okay but the point is that we can go from square root of the number of queries to just square root of the dimensions so we can answer even like an exponential number of queries in the dimension without really paying much for it in sample complexity but perhaps unsurprisingly given the preface to this talk the big downside of this kind of second wave of algorithms is that they're not computationally efficient so if you think about the simple bass line i described the gaussian noise edition basically the bottleneck in the algorithm is just evaluating all the queries right so like you start by writing down this vector with answers to all k queries and um then you just add you just perturb them all with noise so basically the bottleneck is how fast can you write down the queries now like the way that i've written the problem the queries themselves could take exponential time and d to evaluate right i just said you have some arbitrary function that maps d bits to one bit so nothing actually says that even without privacy i can do that in polynomial time but the way also to think about the running time of this gaussian mechanism is that once you're done writing down the queries which for most natural cases would actually be very efficient the additional time to make it private is just sort of linear in the number of queries it's just add a gaussian to every query on the other hand these other algorithms the running time is always exponential in the dimension so like even if the queries are extremely simple to evaluate even if there's just you know d squared queries the running time to get these to add the noise basically to get the answers to the queries to be private will require exponential time in the dimension okay so you know once this picture sort of crystallized it became clear that there's some interesting like computational versus information theoretic phenomenon going on and so people of course started to ask you know can you do any better and here's kind of like the picture that we got to um so people started asking you know can you just get like a third wave of algorithms that's just efficient and works for all queries so can you sort of do better for just worst case queries like any any queries you can possibly imagine asking and what sort of emerges that the answer is no so if you sort of look at my like my table um what it basically says is that if you want a polynomial time algorithm you need the number of samples to grow square root of the number of queries if you're willing to have an exponential time algorithm it can just grow square root of the dimension even when you have many more you know a huge number of queries and people were able to show that this was sort of best possible so the best results in this line of work sort of the information theoretic question can you just get rid of the square root of d completely can you get you know o of one just a constant number of samples so we were able to show in a paper with markbone and sulifidon that's not possible so if you want to answer a lot of queries even information theoretically you need about square root of d samples when the number of queries is bigger than d um but also computationally we were able to show a pretty big gap so in work with luke kowalczyk and tom malkin and mark hendry we showed that um if i give you k like arbitrary queries they can be anything the only constraint is that they have to be themselves something you can evaluate in polynomial time so i don't want a hard problem just because the queries themselves are hard to answer um so there's a set of you know for every k there's a set of k polynomial time sampleable queries uh or polynomial time evaluatable queries where the number of samples you need is polynomial and k so i i don't know how to get exactly k to the one half but polynomial and k so we were able to show like k to the one seventh where there's nothing special or interesting about seven um so what this picture kind of says is that for the most part we sort of understand the worst case complexity for the of this problem where worst case is like worst case over data sets and over queries or sorry over like worst case over populations and worst case over queries and there's definitely interesting open questions here so trying to understand for example uh you know like for what types of queries can you get around this root d lower bound information theoretically um but the sort of like research thrust that i think has been most interesting and where i think there's a lot of a lot of unresolved questions is basically when can you improve the computational complexity so when can you get when can you eliminate these information comp information computation trade-offs or show that they're inherent by assuming like more structure in the queries so assuming that you're not just answering these arbitrary queries defined by some arbitrary predicate on the domain but something very you know very simple with some structure and i think by far like the case that's received the most attention in the sort of computational picture is this example it's called the sort of m way marginals or you could also things like the the moments of the distribution those might call emway parity queries i think moments is actually the thing that's probably like the nicest one but i'll stick with emwa marginals because it's just what people have been doing so these queries are like a case that was sort of has been around since basically the beginning of differential privacy it was kind of motivated by applications in the the census but it has you know an extremely simple description for like the tcs community basically um i have some parameter m think of m as a constant let's say and for every s i have a domain of size 2 to the d so if my domain is d bits and for every subset of m coordinates or at most m coordinates i want to know the expected value when i draw a point from the population of the product of all of the coordinates in the set so for example s if m is 2 s is like all for all pairs i j i want to know the expectation of x i times x j so this is really just asking for the like m order moments of the distribution what is the why does this come up in the census what is the right okay so um basically like what does the census release right so they collect from people uh a relatively small number of attributes so they collect from people like which census block you live in which is a way of encoding your address and uh some representation of your sex and your age and um some race ethnicity and the types of things they release are things like you know in census block 1402 how many people are uh non-white and of voting age and so this is this is this is intersections of a small constant number of attributes exactly yeah so it's like the things they release it can be called these sort of marginal tables where you're basically just asking for like a description of the marginal distribution if i look at you know voting age non-voting age now the main difference is that there you typically don't have binary features um but you know this is a like this is the fundamental problem at the core of what they're trying to do um they're often also not interested in releasing like every single one of these they're interested in some you know particular subset of interests or not every set s of size up to m but like some particular set but i think like as stylized as this problem looks i mean i think this is extremely fundamental and really does sort of capture the difficulty in a lot of the work like they do in practice so i do think it's like quite connected to the practice of differential privacy but of course also i mean this captures things like you know again if you relax like binary features it's like what is the covariance matrix of the distribution it's kind of hard to have a more natural problem than that i mean even if you do have binary features you can ask that question but like um it's a special case of just like what is the covariance of the distribution so um so this is basically the the problem that i i claim is sort of open it's basically how do you answer these queries in a differentially private way and uh just to see what the actual question is so let's sort of look at what our two types of algorithms give so if i want to do gaussian noise addition well the number of queries here is about d to the m and so that algorithm will use about d to the m over two samples so something exponential in m and even if you think of m as a constant right this could be some large polynomial in d and the running time um you know again as long as m is reasonably small and we're willing to just write down all of the say three-way marginals then this is a very efficient algorithm and of course ask if you can do something even more efficient than that but let's not go there um these more advanced mechanisms say it doesn't matter what m is there's maybe a linear and m term here that's hiding but for any constant m i can get d to the one half samples but my running time is now exponential in d and i think i'll have enough time to sort of show you like specifically where that comes up so this is kind of the the state of the world and like the question that i i propose that you know people study and i would love to chat about that people look on is basically are there computationally efficient algorithms for privately releasing m way marginals let's say for any constant m with optimal sample complexity or even just sort of much better than d to the m over two so there's been a little progress on this but not too much and uh the only time i've ever proposed a reward for an open problem uh i the reward was that i will go out to you with you and thomas danke for all you can eat sushi at a particular restaurant in boston and two people simultaneously solved that problem they have not claimed the prize because of like kovit and stuff but uh i figure i have a good track record so i will go out for sushi with you you have to take like we go together though you can't just like take the sushi with someone else does thomas stanky still live in boston it might be a long flight for him no a lot has changed since we proposed this reward yeah it could be really hard it was the reward seemed much easier to give out before when we actually gave them but it does seem to prompt progress on problems and i get to eat sushi um okay so this is the problem so let me just describe like a little bit of the kinds of things that people look at and i think at the time people were looking at these things um you know i think people are taking very different approaches so i wanted to kind of translate some of the algorithms that people have looked at into things that might be uh that feel like active areas of study where i think people like in this room might be overused so um there's sort of a line of work that i think kind of kind of started from a paper of mine with anupam gupta and moritz hart and aaron roth um where we tried to basically resolve this problem using algorithms from uh learning theory from like pac-learning theory and there's kind of a simple idea at the core of this so it is to sort of flip the problem around a little bit so the func we're going to try thinking of a function that takes as input a query so in this case like it takes the description of like a three-way marginal so like three coordinates as input and what it returns is the answer to that query okay like that's a perfectly well-defined function and we're going to try to apply learning algorithms private learning algorithms to that function and why is that useful well what does a learning algorithm do so it gets to see points on the function which in this case means random queries right a point on the function is like a query so i'll sample some random queries and i'll show you the answer to the random subset of queries okay and the thing that makes learning algorithms good is that they need many fewer samples than the input size of the function so even though there's lots of queries a good learning algorithm for this problem would be able to take the answer to a small number of queries and extract from it the answer to all the queries now i need to give you private answers to the queries in the sample and i can do that using the gaussian mechanism if the number of sampled queries is very small so like if you only sample maybe d queries then i can show you the answer to all the queries in the sample with just uh sorry i should not be overusing samples so let's say there's example queries those are like the points here you get to see then i want to give you an estimate of the value of the function on that query using my data set my sample data set and i can do that by adding gaussian noise to the answers and there's a small number of queries so i don't have to add that much noise and then you're going to try to figure out what the answer to the rest of the queries is so kind of the standard learning paradigm you're going to try to fit a function that agrees with all of the queries you've seen so far and hopefully also generalizes to unseen queries and as you know there's a trade-off between the function class and the sample and computational complexity so um you know in fact you can basically get optimal algorithms just by noting that there exists a way there's an information theoretic way to do this using only about d samples for the class of queries i described but the learning problem wouldn't be solved efficiently but we were able to give um say non-trivial algorithms this way so for example using sort of techniques based on chebyshev polynomials we're able to show that you can uh improve the sample complexity to something that's b to the order square root of m the sub exponential in m with about the same running time so the running time i guess is still d to the m uh to write down my queries but the running time overhead is smaller so i mean the point is not so much to go into this result but this is like a paradigm for how you might try to solve this problem you can cast it as a sort of private learning problem where you're trying to come up with an efficient way of seeing the answers to a small number of these marginal queries and you're somehow trying to infer answers to the rest of the marginal queries accurately um and that approach sort of stalled out and with the benefit of hindsight i think the reason that this approach stalled out is because it's essentially like an instance of this noisy tensor completion problem so if you think about what this function is for marginal queries it's like i give you m coordinates and then i ask you to tell me the product of those m coordinates and then average over elements of the population so what this function really is is it's like the the function that tells you a certain coordinate in this like rank one or the convex combination of rank one m's order tensors so that's a problem that's been studied of course um i'm sure there are people in this room who know more about this problem than me but the sort of punchline is that information theoretically if i give you about order d entries in this tensor there's actually a way to recover the missing entries you can complete the missing entries in the tensor but um at least what i know about this problem is that there's a paper by barack and mantra showing that under certain refutation assumptions any computationally efficient algorithm for tensor completion even with a third order tensor requires it has to see about d to the three halves entries of the tensor and that sort of translates to saying that if you're working in this framework any private algorithm would have to have a sub-optimal sample complexity which is at least d to the three-fourths okay so you know already but if you could get d three-quarters for m equals three this is already better than your naive thing right it is would have been three halves the naive thing would have been three halves yes so you can do better and i mean there's a the converse to this is that you can basically take an algorithm for tensor completion and you can actually get you should be this is an upper bound as well in general right yes in general b to the m before so um and that is no i mean i guess as i said this is the state of the problem i didn't say that there were some non-trivial upper bounds uh one of them i stated on the last slide one is this one you're basically figured out but um yes basically this says you can't get better than d to the omega m using this like um [Music] using this approach at least for constant okay so this is one place where there's kind of a like a barrier is that this approach seems to be running into this barrier which is this like tensor completion barrier so there's some rigorous evidence for information theoretic trade-offs in information computation trade-offs in tensor completion uh which only capture this one style of algorithm so let me describe i've got like eight minutes left so i'll sort of describe one other style of algorithm very briefly and then okay you know the point is basically i've been you know doodling slides for the last 18 months with no one to show them to so like i could be talking forever if no one cuts me off okay so i'm gonna describe one other approach that comes from a really elegant paper by nikola caldwell and john that is i think spiritually quite related but it has some slightly different features that make me think it's interesting to study so it their mechanism is called the projection mechanism so they don't cast the problem as a learning problem what they do is they say let's just start with the trivial algorithm so let's just start by taking the uh we'll just write down all d to the m answers to the queries and we'll just use this gaussian mechanism and we'll add so much noise that it's private like anyway so even though we have to add a lot of noise in fact when the number of samples we have is small we have to add so much noise that the answers don't even live between negative one and one anymore they're like totally you know way outside the realm of plausibility but what we'll do is we'll say okay you know good so our answers are not consistent with any actual distribution these cannot actually be the nth moments of any distribution anymore they've completely lost their like tensor structure so let's basically just project back into the set of answers we could have gotten that are actually consistent with some distribution so you can define this polytope k and this polytub k it's just sort of the it's the set of possible answers you would get to all of the moment queries if these were the real answers on some population so it's like a polytope that's like the convex hull of two to the d different vertices each one each is defined by a vector and plus minus one to the d so i started with my my true answer here it's this blue dot y i've like added some crazy amount of noise to make it private so i have this point y tilde that doesn't even live in k anymore so let's just project it back into the polygon and like in the obvious sense like in l2 and first thing actually the one nice property of differential privacy is that once i have the purple point i can do anything i want to it and it will remain differentially private it's closed under post-processing so privacy analysis is basically just this purple point sort of trivial the question then is is this red point y-hat actually closer to y than i started by like a non-trivial amount and the answer is yes if this polytope k has low what's called gaussian width so the what you can show is that the distance between the blue point and the red point is proportional to the gaussian width of this polytope and the gaussian width if you haven't seen it it says i draw a standard normal g and i look at the largest inner product of g with anything in the polytope in absolute value and i look at the expectation over g of that quantity so it's like a relaxation of having bounded diameter and what you can show is that as long as k has bounded gaussian width small gaussian width this will be very accurate it will be information theoretically basically optimal so what's really surprising is that i'm taking this mechanism that adds such a crazy amount of noise that the answers look useless but they're actually enough information theoretically to recover very good answers in fact information theoretically the best answers we could get with privacy so uh i'm sorry i should say so obviously this projection is not efficient because this is projecting on to the convex the convex hall of you know third order tensor is defined by like a binary vector but we can try some sort of relaxation and um our goal could be to find some sort of efficient relaxation of this set k that also has low gaussian width and emits efficient projection and in a fall paper dork nikolov and talwar found such a relaxation based on on stps and it doesn't quite give d to the m over four it has this sort of you know parity issue it gives like d to the m over two ceiling over two but you can use sort of you know fancier ideas to get it down to like d to the m over four so it's basically d to the m over four i'll say morally m over four one particular consequence is this does give an optimal algorithm for m equals two so for the second moment case we have an optimal algorithm but it's still not optimal for m equals three and higher okay so these are basically the algorithms we know and you know you can see like this i think sort of progress on this runs into trouble in a similar place that progress on noisy tensor completion runs into trouble but it's like less the the connection to refutation i don't actually know that it goes through so i think there's one sort of nice question there is could you show for example that anything in this framework uh cannot be efficient and optimal say under these like refutation assumptions i don't know how to do that but it seems at least closer to something that we could study so i will uh sort of wrap up there with this question and let me just make sort of like one observation about kind of like why i think we're sort of stuck on these problems so like where i think the kind of conceptual bottlenecks are so the first one is like a little flip but it's sort of that i think we sort of used up all the cryptography and proving the worst case lower bounds so what i mean by that is sort of like an analogy to say learning theory so if you look at the history of proving hardness results for pac learning where similar types of problems arise the history was sort of like very shortly after pac learning was defined uh it was shown that if you believe in cryptography like if there's pseudorandom functions then there's some learning problem that you can information theoretically solved but you can't solve efficiently and once you sort of like know what a pseudo-random function is it's like basically a very very easy simple reduction you sort of use like the minimal amount of crypto that you can imagine like minimal in the sense of like you need about one week of cryptography class to get there um then if you want to show things like it's hard to learn um you know intersections of half spaces and stuff you start introducing like fancier cryptographic instructions and uh public key encryption schemes with nice properties and pseudorandom functions that are can be evaluated in very low complexity um but you have a lot of room for that because like the base hardness that sort of the worst case hardness for some worst case concept class was using only the simplest crypto uh in differential privacy even to show the worst case hardness results i described you actually need like basically like all the crypto you need indistinguishability obfuscation is the current state of the art which is like the uber crypto assumption um and so we've kind of you have to use so much cryptography to like try to understand even the worst case complexity of these problems like worst case over queries that you just don't really have a lot left to there's not a lot of tools left to use that's in some sense like more of a sociological statement than a technical statement but i think it's sort of like meaningful that it's just very hard to even figure out like what types of like assumptions in crypto or primitives and crypto would be useful for showing that these problems are hard um answer two which i think is almost like total logical you say it is like we just don't really understand much about how privacy restricts the types of algorithms you can you can use so right like in all of the problem in all the types of frameworks i described what we're doing is we're like tying our hands in some way so we're saying let's like replace the data set with some let's sort of imagine that we just have some sort of simple oracle like we get to see the answers to a small number of marginal queries or we get to see answers to uh yeah it's like a small number of queries or we get to see very very very noisy answers to all the queries like we're tying our hands and we're saying we'll tie our hands in such a way that we can sort of implement this oracle access to p in a private way using like one of the few like out really like using the very simple private algorithms we know so like just using the simple idea of adding gaussian noise and we'll basically ask like what can we do with that so if we get this very limited access to the population you know what can we do with that but in reality like we actually have this sample like like we have it like it's staring us in the face right we have the whole thing and we're in a regime where the sample and the population are basically interchangeable it's not really hard to infer things about the population from the sample and so the only thing that's like stopping us is that we have to use the sample in like a private way and we just don't really have a good handle on like what type like how can we access the data to make it like privacy friendly so we know there's things we can do but we don't really have a good characterization of like what you know how does it limit you so it's really hard to come up with like reductions to say hard learning problems or hard robust statistics problems or any kind of heart inference problem because sort of as far as you know like the sample is just this thing you hold right like you can sort of in principle you can sort of do most things with it you just have to do private things we just don't really have a sense of like what makes an algorithm private so of course in some sense this is like a tautology like we don't know how to prove lower bounds because we don't know what the set of algorithms is but i think there's something like conceptual here about why it's very hard to even come up with like a candidate reduction from some of these problems to say like planted problems or refutation problems so i think that's kind of where we're stuck and hopefully i inspired some of you to think about this so i will just stop there and take questions oh i have some questions so like in the definition of the private sv model you will only allow for oblivious queries but if i want to make adaptive queries where like the second query depends on the first one does that totally change the picture or what happens yeah so great so it's one of the slides that i basically did so so basically yeah so the way i defined it like we just have a fixed set of queries there's just i want marginals um you can of course imagine that i sort of treat the data set as like uh something you can access interactively so we do know a little more about that like we know that that problem is in some sense harder like it's easier to prove lower bounds um but it's also in some sense like so it's something that's just harder right so it's easier to prove lower bounds it's in some other sense like it can be you can think of it as easier in the sense like maybe i have this huge set of marginals there's like d to the 100 marginals but i don't want to answer all of them i just want to let you ask me for like a subset of them so you can get into a realm where you could also imagine that it's like easier the short answer is it's kind of just a different model we have um you know i would say that when you start thinking about these questions about like nice queries it's a little less natural like i let you ask any query you want if it's a marginal query maybe i should have just answered all the marginal queries in the first place but it is definitely something you can study we do know that in the worst case um it's very hard basically if i'm going to let you ask vk arbitrary queries you can't answer more than yeah i need i need root k samples to to do that is true even without differential privacy or ah good question so that's actually so one of the reasons i didn't talk about it's actually much less that problem is much less well understood even without privacy um but the short answer is um [Music] if the dimension is unbounded you might actually need root k samples without privacy so the lower bounds are in some sense less striking like we have the similar lower bounds but it's less clear that you should have been expecting to solve that problem effectively yeah in the census situation you described where you do want to answer marginal queries but actually it's only for certain subsets of the i don't i don't actually need all the unwise marginals can you take advantage of wanting suppose there's you know a relatively small number of my marginals that you actually want to know can you take advantage of this well so if the total number you want to know the answer to is smaller than d then you're sort of back in the case where you just have a small set of queries they might as well even be worst case periods um but even then i guess the thing that they sort of want to do the thing that you still sort of want to do in practice is enforce these consistency constraints so uh so even if you're in this regime or you have fewer than d queries if the queries are overlapping in some way you should be getting uh some benefit out of like having these consistency constraints and there um even if you have a small relatively small number of these uh marginals it may be it may still be hard to like find a consistent data like a consistent set of answers uh it's essentially can sort of be thought of like inferring a sparse graphical model or something like that so it's it's a problem where sort of more even hope to do but there's still some hardness there and also relatedly again back to the sentence motivation it seems like there uh some of some of the some of the attributes have like a really large number of possible values and some have much smaller you know like which census block there's a lot of census blocks uh which which um many other many other attributes have fewer possible answers although maybe some of them should have more than you are allowed to give on the census but right so for example the census has males and females maybe that should be for race and ethnicity then there's like 63 i think possible ways of answering the the race question which is still way less than the number of census blocks so does can these methods take advantage of there being fewer or like what happens when you have a lot like really a large number of possible values for it seems like it gets very different from binary answers when you have yeah okay so i guess i should say so for the blocks thing really what they want is more of like a histogram like they want some set of queries but answered on like every individual block so there's not a lot of like i want all the blocks that have the bottom number because that's not like a geographically meaningful thing so forget the blocks then for a minute but within a block you're right so you have a group in various ways like you don't really care about ages and years you mostly care about they use for example fight your blocks and things like voting age non-voting age um so you can do a lot to explain that so even if you have say like you know imagine the only thing you care about is representing the age distribution like you just want to get say the cumulative distribution function of ages there's a lot of very non-trivial stuff you can do that's like a problem that's been studied a lot and then you can sort of imagine studying the problem of like i want to get sort of uh a table of like within each racial category what is the cdf of the ages and now you have sort of like a marginal but where one of the features is not binary um imagine like some more complicated problems explicitly but yeah you can definitely get into an interesting territory where you can do very interesting things with like univariate attributes and then you can imagine sort of marginals of univariate attributes john can i ask a question sure mystery person yeah i'm pravesh sorry i'm joining on zoom all right uh yeah this is about this dnt uh this even versus odd issue that you were pointing out um and so if i understand correctly or is the following understanding of the problem correct so i have um d to the one point five i don't know d to the one point five log d uh of the triples like you know three wise marginals right and i have like an independent gaussian and i would like to certify a good upper bound uh you know via some efficient relaxation on the gaussian linear combination of this due to the 1.5 log d marginals and their arbitrary marginals is that correct like if i could if i could get a relaxation that gives me a good value for this then you know it would imply that the gaussian width of the polygon you were defining is like certifiably small would that mean that we resolve this just i think it's just that he wants a uh some some some relaxation of the set of distributions with the small gaussian width the realization of the set of ky's moments with small gaussian with it's it's um it's refuting random random gaussian polynomials yeah right that's a good thing just refuting random gaussian polynomials as opposed to so i i guess so just to confirm it seems like the monomials in your case are not random right so that's a semi-random part right like the monomials are arbitrary the coefficients are random so in the point in the version i was describing you get all its d you get all d to the d cubed entries with like a ton of noise i see so it's like it's not even like semi-random it's like you have all the monomials i see but with random coefficients i see i see okay and and but they're like so we you you want to it's not sort of the even odd thing so like if you have d though if the gaussian width is whatever gives you like a d to the 1.5 algorithm then we know how to get that via like reduction to tensor completion what i want is like d to the 1.4 i see okay okay thanks but i'm confused about the connection with tensor completion right now so so as you said if i have um i look at the set of all uh order m moments and let's say that i i just looked at the empirical needs over a population of size uh people what do we call the size of the it's just a population like it's an infinite population the sample size is like n but okay but like i'm saying it's without loss of generality the sample size can be fairly small okay so whatever that samples whatever that effective sample size is then this set of all amps order moments is a tensor whose rank is that sample size but then we add gaussian noise to it yeah but um when you said tensor completion i thought that meant like you're going to tell me the values or the noisy values of a certain number of these moments yeah and from that i'm going to reconstruct the rest yes so there were two frameworks so one of them is basically i give you a subset of entries of the tensor with a small amount of noise that's a tensor completion problem yes but what i'm saying is you know tensor completion like that is the problem we have here i hold the data set so i could i could give you the full tensor but to do it privately i'd have to add a ton of noise to it right so there's like a spectrum of like i could either give you a small sub sample of the entries with a little noise or i could give you all the entries with a lot of noise and i only know the connection to tensor completion if you decide you're going to give me a small sample but in principle you could get a private algorithm in this thing you described which is not tensor completion and that's sort of what what profession is asking obviously so i'm curious about this effective sample size because that's like getting a low rank approximation to the true uh correlation tensor or whatever we call that the moment answer yeah yeah so i guess for that since i'm interested in an entry-wise error i guess you can just always like sample uh like one if you want say like alpha error for every coordinate you can sample like one over alpha squared elements from your population uh log of log of the number of moments for a union bound and that would give you the right that that would be accurate so it says there exists a tensor of rank about one over alpha squared that would d over alpha square that would approximate entryways it's specific to the fact that i want to approximate entry once if i wanted like a approximation say like if this were like the matrix setting and i want to approximate like for being a norm i guess for things those are great questions awesome all right classes
Info
Channel: Simons Institute
Views: 542
Rating: 5 out of 5
Keywords: Simons Institute, theoretical computer science, UC Berkeley, Computer Science, Theory of Computation, Theory of Computing, Rigorous Evidence for Information-Computation Trade-offs, Jonathan Ullman
Id: JHtMn9C2cc0
Channel Id: undefined
Length: 59min 40sec (3580 seconds)
Published: Fri Sep 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.