LIDS@80: Session 3 Panel Discussion

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

CAROLINE UHLER: Welcome back, everyone, to the panel From Data to Inference and Machine Learning. It's a pleasure for me to introduce the panel for this morning. Unfortunately, I should announce that Emily Fox was supposed to be on the panel and due to a family emergency is not able to be here. But I think we have a really wonderful panel here together today. So we'll have four short talks. So first we'll start with Guy Bresler who is an associate professor here at MIT in the Department of Electrical Engineering and Computer Science, and in particular a member of LIDS. His research is in high dimensional problems, in particular related to or in the context of graphical models. Then we'll have Constantine Caramanis who is in the Department of Electrical and Computer Engineering at the University of Texas at Austin. He does a lot of different kinds of research, in particular on problems around learning and computation and very large scale networks. Then we'll have Suvrit Sra, who is also faculty here at MIT, associate professor in the Department of Electrical Engineering and Computer Science, also a core member of LIDS. His research is mainly at the intersection of optimization and machine learning. And finally, I don't have to introduce Ahmed Tewfik again. He will also give a very short overview on what all of us think about the present or the past, the present, and the future in the area of from data to inference and to machine learning. So with that, Guy, I let you take the podium. And so since we have one speaker less, we'll actually be not so pressed on time. So we'll have about 15 minutes per speaker. And as in the last sessions, let's just have all the questions in the end since we'll have a whole half an hour actually for just discussions. GUY BRESLER: Great. Thanks, Caroline. Good morning, everybody. So I'm going to talk a little bit about a certain question, a set of questions at the interface of computation and statistics. And I'll describe a particular set of questions that I've been obsessed with for the last few years that I believe is a real goldmine for information theorists and probabilists and really LIDS type of people. So before I jump right in, the high level objective is to try to simplify things a little bit, to try to get at the essence of some of these problems and to simplify the landscape that frankly, to me, is a little bit of an overwhelming landscape. So it's about high dimensional statistical estimation. There are many, many, many estimation problems that you might be interested in. I'll just start with one that has been of interest to many folks in the scientific community, which is trying to distinguish whether there is such a thing as the Higgs boson or not. So you do a bunch of experiments in this kind of universe. And then you collect your data, maybe you tally counts of energy versus the number of collisions at that energy and so forth. And then you collect your data. And then somebody somewhere has to decide, yes, this is indicative of having a Higgs boson or hm, no, probably not. It's just some random garbage and this whole giant collider thing was just a waste of billions and billions of dollars. So one wants to decide this sort of thing. And somehow the point is that the signal is quite weak. And so it's a challenging sort of estimation problem. Because this is just one estimation problem. Everybody has their own favorite estimation problem. And simplifying at the very high level, you could think of all of them as roughly signal plus noise. And your task is to estimate the signal and the noise. Now of course, the details of the problem change from setting to setting. Each noise has a different distribution, maybe different characteristics. The signal has different characteristics, different combinatorial structure that you might assume on it that allows you to carry out the estimation and so forth. But at the high level, this is maybe how one might think about statistical estimation. Let me describe another concrete mathematical formulation of a statistical estimation problem. And I'll describe a detection version of this. The goal here is just to detect, is there even a signal. Or is it pure noise. And everything I say in the next 10 minutes will be about the sort of detection problem. But the same ideas will apply, sort of verbatim, to estimation problems. Let's see if I've got a laser. Excellent. So here's the problem. This is called sparse PCA. You've probably all heard of it. This particular model for sparse PCA is a probabilistic model for understanding algorithms for sparse PCA. And it's called the spiked covariance model of Johnstone and Lu. So the idea here is that there are two possible versions of reality. One is the pure noise. And you observe n samples from isotopic Gaussian. The alternative is that there is a signal and you observe n samples from a multivariate Gaussian or now the covariance matrix is no longer identity. There's this rank 1 spike here. So theta is a signal strength parameter. It's telling you how strong the signal is. u here is the direction of the spike. And in the sparse PCA problem, the spike is sparse. So there's k non-zero entries in the spike. Now your goal is to distinguish which of these is the case. And we'll say that you've succeeded if asymptotically as the parameters, as the dimension number of samples and so forth scale together, you'll say that you've succeeded if the probability of error is going to 0. Now there's this the average case that appeared in the title. And I'll say average case again and again and again. All average case means is that this is a problem defined over probability distributions. So the idea is you have some data set that you're trying to measure. And the data is not adversarially generated. It's just generated by some process that you're observing. And so, average case refers to the problem input being described by a probability distribution. So we're information theorists and control theorists and stuff. And we want to understand, what are the fundamental limits of estimation or detection of this sort of problem. And so one can try to plot the feasibility of whether one can do this. And so here we have on the vertical axis, the signal strength. N to the minus beta is how it's parameterized. As beta gets larger, the signal strength gets weaker. On the horizontal axis, we have a sparsity A that's parameterized n to the alpha. So as alpha gets bigger, the number of non-zero entries in the spike get bigger. And this is just specifically for sparse PCA, the problem I just described. And I'm going to have d equals to n, so the dimension that we're living in is the number of samples. And that's just so we can plot this in two dimensions. And this problem is extremely well studied. And one of the landmark papers about this problem is that of Berte and Rigollet. Rigollet being our colleague in LIDS and in the math department. And they asked, well, when can we solve this. And they came up with the falling phase transition. If theta sits above here, then the problem's impossible. There's just not enough information to solve this problem. If theta is below this or beta, I guess, is below this, then theta is big and one can estimate. This is good. This answers the question from the information side of things. And the next question is, well, what about algorithms. And they analyze some algorithms, semi-definite programming, and other. And this is what was achieved by the algorithm. And it requires a bigger signal strength. There's a k squared here instead of k, that's information theoretically necessary. So at this point there are two options-- either try harder and come up with a better algorithm or give up. And it turns out that the right answer here is to give up. Not usually what's recommended. But here you definitely want to give up because you're not going to find a better algorithm. And they showed this by reducing from a conjecturally hard problem technically to sparse PCA. And they derived this hard triangle. And what this hard triangle tells you is that, well, at least for this corner point here, there is no hope of improving the signal-to-noise ratio or the number of samples required by efficient algorithms under this hard conjecture and this hardness conjecture that's widely believed. So subsequently-- or not subsequently, some of them before. But there were dozens of algorithms analyzed for this problem. There's many, many researchers really focused on it. They analyzed all sorts of algorithms and different regimes, depending on the regime, different algorithms are optimal or seemingly optimal. And then that's really the question, is what happens in these blank regions. And I jumped into the fray with my student Matt Brennan and filled in a little bit of these hard regions. And at this point, you could say, well, this is great. We've fully characterized the feasibility of this problem and all parameter regimes. But the dissatisfying part of this and the kind of troubling part of this is that this is a huge amount of work. And this is quite daunting to try to then think about how in the world are we going to do this for every other high dimensional statistics problem. And there are many, many such problems out there. For instance, graph clustering problems, tensor estimation problems, various regression problems, cryo EM, all sorts of scientifically motivated problems, et cetera. So the sort of motivating question here is, do we have to go through this whole thing with dozens and dozens of people working really hard for many years to try to understand each of these problems separately. And when I say understand, I mean exactly this trade off between computational complexity and statistical complexity. So maybe a hope, a sort of suggestion from our friends at the other tower in Stata is maybe we can aim to simplify the landscape a little bit. And so this is the complexity zoo, as it's called. You can go, there's a wiki page like this. And the complexity zoo, what it does is it classifies computational problems into different classes. And currently there are apparently 544 classes and counting. So maybe there will be a new complexity class to be discovered at some point. And the point is that within a complexity class, all of the problems are strongly and formally shown to be equivalent. So if you understand how one of those problems behaves, then you understand all of them. And that's a beautiful thing. Because, well, I can't store all of this complexity in my head, all of these dozens of high dimensional statistics problems of interest. And this is a much simpler landscape than one might hope for. So how does one reason about equivalence of problems? Well, the bread and butter of complexity theory is arguments that are done by reduction. So reduction just means you're taking one problem of interest, for instance SAT and you transform it to another problem of interest, for instance independent set in a graph. Of course, the mapping that's doing this transformation has to be done in polynomial time. Otherwise you're not able to draw any conclusions about the complexity of one versus the other. And this is really the approach put forth in Karp's landmark reducability among combinatorial problems paper that really got this field going. Now if one zooms in and looks at how is that reduction that I just mentioned actually done, well, it takes a three-step formula and it produces a very particular graph where there are three nodes for every clause. And these clauses are linked together in a very specific way. And people generically refer to this kind of construction as gadget-based construction. And if you think about it for a moment, you'll realize that even if you started with good distribution or natural distribution over SAT formulas, for instance uniform over three SAT, what you end up with output is from a garbage distribution and has nothing to do with any distribution over graphs that might make sense to somebody studying graphs. Because these gadgets are there, one has sort of a structured output distribution. So these are reductions that are tailored for worst case complexity. Because they're trying to reason about those sorts of problems. And unfortunately, they don't work for the sort of average case problems that we're interested in in statistics. I will say there is this landmark work of Levin that puts forth average case complexity as a field, really. But that whole theory is really tailored towards completeness results for NP problems, for distributional NP problems. And so it's ambitious in what it aims to do. And for that reason, things haven't really progressed as much as one would hope. So there have been many, many approaches in the literature over the last decade or so to try to understand the interface or the interplay between computation and statistics for these sorts of high dimensional problems that I'm talking about. I won't really describe these in the interest of time. But I will say that there have been a few papers on average case complexity. But because the average case reductions that preserve the distribution from the input to the output can actually land you on the problem with the correct distribution that you're trying to map to, are challenging to get. These approaches have kind of flourished so that they're in some ways easier to get the predictions, these feasibility diagrams that one hopes for. Nevertheless, there are some huge advantages for average case reductions. Primarily, that they really connect to the problems that one is mapping one to the other directly. And so they simplify the landscape. Instead of studying each problem in isolation and repeating the whole machinery of modern algorithmic analysis for every class of algorithms for the new problem that you wish to understand, the dream is that you can just say, well, this new problem that I don't really understand is intimately connected to this problem that I fully understand. And you can then transfer that understanding. So it was at this point that I call on all of the audience and people from LIDS heritage and LIDS-related heritage to say that this is really an information theory problem. One is transforming a probability distribution, essentially, into a different probability distribution. And the way that one does that is via an average case reduction, which is really just a channel. And so you can think of it as designing a channel that has as its output the thing that you wish to have out there. And of course computing the channel has to be done in polynomial time and so forth. So there's a little bit of an algorithmic flavor to it. But it's really an information theory problem to relate to problems of this form. Now another point and, this is maybe an aside, is that average case complexity for these combinatorial problems, as I said, is notoriously challenging. And there's another thing that we have going for us here which is that all of these problems have an SNR parameter. And when we're carrying out a reduction from one problem to the other, we can allow ourselves a little bit of loss in SNR and still get interesting and meaningful conclusions about the relationships between the two problems if one is losing a negligible amount of SNR. So that, through continuous parameter, gives us a freedom that I think makes a big difference. So building on landmark papers of [INAUDIBLE] and [INAUDIBLE].. These are some of the reductions that my student Matt Brennan and I have obtained. And the point here is just to say that this is a proof of concept. The average case reductions really are a fruitful way to move forward. And really, this is just kind of the first tiny-- not the first, but a first-- a tiny step in the direction that I'm advocating for just to say that there is hope, that it is a useful way of thinking about these problems. Let me just conclude by saying that we're hugely optimistic but there is a lot left to do. And among the many things that one might dream about are, firstly, understanding some of this zoo of problems that we haven't accessed, which is the vast majority of them. One would like to understand equivalencies between problems and a very strong sense. These 25 problems are really the same problem at their core. And ideally a more general theory to understand, are there different classes of problems and how do they relate-- and I'll end there. [APPLAUSE] CAROLINE UHLER: Thank you very much, Guy. So maybe we'll take all the questions at the end. CONSTANTINE CARAMANIS: Good morning, everyone. It's really great to be here. I want to thank Caroline for organizing the session and John for everything and inviting me. It's 22 years ago this spring I was fortunate to take 6262 with Bob Gallagher. That was my first-- my first contact with MIT. And then later that year, I was able to work with Dimitri and John actually on making some probability problems. I was just an undergrad then for the undergrad book. And then when I finally-- when I finally joined MIT as a graduate student, Sanjoy really took me under his wing when I joined LIDS. And I'm extremely grateful for all of that. So where's Alan? So Alan, I didn't really get to interact too much with him. But there's a good reason for that. And it also left me with one of the important lessons that I carry with me in my life which, is that-- and it really shows how caring LIDS is. And it's a one-stop shop for everything. You come here to learn. But you also come here for life advice. I have stayed in shape, tried to stay in shape because of Alan because I never once saw him not running. In the old building, in the metal processing lab, whenever I saw-- passed him in the hall, he was just always running, running, running. So LIDS has been extremely important for me. And it's really wonderful to be back here, such a supportive, such a friendly place. And on the topic of friendly, I have to say, Guy, that was a wonderful talk. I've never learned so much in three slides in my entire life, which was what our instructions were. So that said, let me figure this out. Which way is forward? So I want to talk about LIDS and my experience and in particular, machine learning before it was cool. And I think after, it will be cool. I think many of us are eagerly awaiting that moment, even if we're working in that area. So I want to mention one of the topics that, to me, has really clarified my thinking about a lot of problems in particularly in the area of high dimensional statistics. And it's the contributions have indeed come-- I didn't put any citations because, like I said, I have friends and I want to keep friends. So I just left it like this. Those of you that are familiar with the area are going to recognize who's been responsible for a lot of these contributions and a lot have come from LIDS and LIDS alumni. So my particular view of the world is always colored by thinking of uncertainty and starting with modeling uncertainty, which has been core to what we've learned in LIDS. And when I think about one of the key questions in machine learning, we can think about it in those terms. So machine learning is really how do we think about model and deal with it from an algorithmic perspective? The uncertainty in the future distribution, so what samples we're going to see, given the empirical distribution, the data that we have versus the data that we'll see. And one of the areas where this story, I think, has at least brought some light in my mind is the role of convexity as we see that in high dimensional M estimation, high dimensional statistics. So the problem is that we have had a huge success in this community. And at this point are considered classical, things like compressed sensing, matrix completion, and all of those problems in that surrounding space that many people here have spent a lot of time thinking about. So what is the role of convexity and how does this relate to uncertainty? So one of the key lessons that high dimensional models force us to think about is that the empirical distribution for a high dimensional model where you have many fewer samples than the dimensions can actually be very, very different than what you're going to see next. Yet somehow, there's some stability, which allows us to do inference and to solve problems in this space. And the work that I think started in compressed sensing but was really sharp and later on with concepts like the role of restricted strong convexity and so on, illustrate that what convexity buys us in this case is it allows us to control and to bound the important ways that the empirical distribution differs from the population distribution. And in my mind, it's these results that have given a way forward to a lot of these problems. So the high level lesson and, I think, one of the, again, one of, for me, one of these recent successes that's come out of LIDS and alumni and related community is this high level lesson that convexity gives us this connection and provides stability that help us control this uncertainty. So looking forward to what are the present challenges now. And what do I think is exciting along the same lines? Of course, the whole world is exciting. So I'm just choosing a very small slice. And I want to continue with this with the same story of trying to understand the uncertainty between the data that we have and we can see, and then what is going to come next. The central problem in any prediction problem. And as Peter's talk this morning illustrated, one of the main challenges is in coming up with an understanding, new representations that are going to allow us an understanding what is a new structure that we need to exploit. So problems that I've been really interested in, and also many others here, are problems that have to do with heavy tails Peter also talked about that this morning, but also problems where some of your data may be corrupted. So this problem has been around for a long time in the statistics literature, of trying to deal with corrupted data. Everything needs a good new name once in a while. That's the case now that we think about neural networks. So corrupted data at test time, these are called adversarial examples. Probably many of you have seen this illustration of how fragile state-of-the-art neural networks are even for image recognition where small, almost imperceptible changes in the image can lead us to-- can completely misclassify the examples. And then also attacks at training time, which is which is now called data poisoning. So in the context of modeling uncertainty, I think one of the challenges here is that we no longer have this tool that-- we were in search of different ways to bound and control this difference between the data that we have and the data that we're going to see to model this uncertainty. I think this is one of the main challenges. And I want to point out that even in problems that we think are simple, when you add something like heavy tails or add a few corruptions, even things like solving linear equations with heavy tails in a high dimensional setting, is a challenging problem, even though this is so close to problems that we've been working on for such a long time. So I think that understanding of this perspective and working on robustness is one of the exciting things that's going on right now. So this is some present challenges, we were asked to speak about some present challenges. But aside from technical challenges, I think that this community faces other present challenges as well. And I want to just devote one slide to that. I think that, as anybody who's served on a recruiting committee, faculty hiring committee, graduate student committee knows, the problems of inclusion and broadening participation in our area are extremely challenging and difficult to address. But I think that there's also a huge opportunity that the frenzy on neural networks and machine learning presents for us, and just want to mention a few of my thoughts on that, with respect to undergraduate research. So LIDS has been, as [INAUDIBLE] mentioned, focus on rigor. It's very mathematical. And many of-- it's very difficult to start on research until you're a first year graduate student or later. It's difficult-- I found it difficult to involve undergraduates in my work in some kind of meaningful way. And so the problem is if we can't involve students in a meaningful way in research until they've already passed all the hurdles that are hurdles to inclusion and broadening participation then this is a problem. So how do we maintain that rigor and do the work that we want, but also are able to meaningfully involve undergraduates. So the reason that I think that the current hype, let's say, on machine learning is exciting is that there's so much work that's empirical that we may involve undergraduates that haven't gotten to the levels of math that we need for our work to do. And I think that this is a really great possibility. And one thing that I would like to encourage, those that haven't thought about this before, is to let undergraduates play around with empirical problems in neural networks. So much of what's exciting is basically empirical in this area, even if you don't consider it as mission critical in what you're going to publish. So this is something that I'm trying to do. For those of you that have done it, I'm really eager to hear what your experiences have been in lessons. I think this is an opportunity, for this community in particular, to stay focused on what we're doing but also really broaden the population of people that we can get interested in our problems. Something else that I think is an important present challenge has to do with cultural publishing. And as I was typing this out, I was like, gosh, this is like an old guy rant that I'm going on. And it's changed dramatically since the good old days when a LIDS graduate could get a faculty job without any publications. But now I'm sure many of you feel this. And it's very difficult to change going from cycle to cycle, submission cycle to submission cycle because you need to do it for your students, if you're not doing it for yourself. So there's something-- there's a momentum that I'm unclear on how to fight. On the other hand, I have to say that I'm extremely thankful to several anonymous reviewers for consistently rejecting one particular paper because now I actually think that we were onto something. But, you know, I say that jokingly but part of me thinks like, then why did I submit it if it wasn't ready, especially in all of these venues that are the final word. We're not submitting journals after this. So anyway, I think that this is a main challenge. Another challenge that I think is really important for us to think about is who's setting the educational agenda in an environment where industry is pushing ahead so fast and also students are voting with their feet in a way that impacts all of us. So a LIDS traditionally sits on the EE side of EECS. And I'm in an ECE department. And as we all see, just the evidence says who's getting hired. Where are our students going? They're going to a computer science. And things change. But we're all very influenced by this. Just look at how many places have tried to rebrand themselves, how many how many departments have rushed to add machine learning courses or rushed to put data science somewhere in their name. And so I think this is an interesting question. Looking forward, I think that academics are generally conservative in terms of how much we speculate. But we actually are speculating without realizing it every time we teach a course. We're speculating in the most risky way because it's whatever we teach, you can't go back and find students that graduated two years ago and change your mind because you've rethought what you're going to teach them. So we're basically making bets right now that we can't-- we can't change our position. So it's very interesting. And I think we need to grapple with that to understand it. So I have to say that when I think about LIDS, when I was here there wasn't really such an emphasis on machine. Learning, but we learned about control and feedback and partial feedback and optimization, distributed parallel computation, dynamic programming, approximate dynamic programming. And magically, those seem to be exactly the right things. So I'm in awe of all of the LIDS faculty that designed that curriculum and had this foresight. And I think it's going to be-- it's a main challenge for us to think about how we're going to do that-- how we're going to do that again and be as successful as they were. Thanks very much for everyone who's had such a great influence on me and everyone else. I look forward to hearing the rest of the panel. Thank you. [APPLAUSE] CAROLINE UHLER: Thank you very much, Constantine. SUVRIT SRA: Hi. My name is Suvrit. I'm LIDS member. And let me begin by actually narrating a little bit in honor of the three honorees that I have interacted with. So when I came to MIT I was next door neighbor to Alan. And I recall how welcoming Alan was and how many times I just walked into his office and he shared his valuable wisdom. But even beyond wisdom, the energy and the enthusiasm that he radiated, I found that really inspirational. And on the same floor, I got to know Sanjoy. And Sanjoy is really impressive in so many dimensions. He is a true scholar. He has interests broadly in science. He's not a narrow person at all. And I loved the tremendous breadth. And I recall pretty much any mathematical topic that I mentioned to Sanjoy, that these days I'm interested in geometric measure theory, whatever, any topic, Sanjoy says, look at this book. And Sanjoy always had a connection or a point of reference which broadened my exposure of knowledge to anything I talked about with him. And then I changed offices recently. And I am next door neighbor to Dimitri. And in fact, my journey with optimisation, a lot of my research lies in optimization, begins with actually learning from Dimitri's book on non-linear programming. And actually, I knew Dimitri long before, personally, before I came to MIT. Because I organized at that time NIPS, these days NeurIPS conference Workshop on Optimization for Machine Learning. And the very first edition of that conference is that was Dimitri was a plenary speaker there. And then at the 10th anniversary edition, he was again there. And so my association with him goes long back. And I just find that while Dimitri, you write books faster than I can ever read them. It's just amazing. But probably that has happened to many of us. So that was honoring them. So now in light of Alan's session, let me just comment a few things about some past stuff, past-past stuff, and then ultra-biased recent past and present, ultra-biased as in bias to my own research interests. So I'm not going to read through the laundry list. But I just wanted to mention-- I actually call it LIDS Related. Most of it is Alan-related and Alan-plus-Pablo-related, which is up there. Because I just wanted to mention that stuff in this session, really impressive work which you have seen references to on graphical models and inference from Eric Sudderth and Wainright. This is still biased. I didn't manage to comb through the massive list of alumni that Alan has on his web page. And foundational work and sparsity with both statistics and optimization flavor and lots of other stuff I just associated some names with that. And within that context, within that broad context of statistics, signal processing, machine learning, and optimization, I kind of feel that I fit kind of right in there as a joint member of Alan, Sanjoy's, and Dimitri's group. And so let me mention a little bit about stuff from my recent past now. Because I am proud that my first three PhD students graduated earlier this year. And you're looking at stuff in large scale. So I've been looking at stuff in large scale non-convex optimization for actually even longer than before large scale non-convex optimization became the thing, thanks to deep neural networks. And we have some interesting results in there regarding stochastic gradient methods which probably many of you have heard of. But if you want to have methods that are provably and empirically better than that method, you want to save on computation. How do you go about designing that? And a different topic, which is, again, intimately motivated by challenges and practical problems. Consider the following simple problem. You have n items. And you wish to pick a small number of them to recommend to a user. So if you have n items, the number of possible subsets is true to the n. So you have an exponentially large number of choice of subsets to pick from. And you now want to pick out of that a diverse subset. You don't want to recommend the same thing repeatedly, like Amazon and Twitter still do, ruining life for many of us. How do you go about that? So it turns out that this harmless sounding question which underlies pretty much any recommender system on the internet leads to this really cool probabilistic modeling question. I have n sets of discrete items. I want to place a probability measure on them so that if I sample according to this measure, somehow the items that it shows me are diverse. And this leads to fascinating areas of mathematics connecting to what is known as the theory of real stable polynomials, which have been very influential in the past few years. And it turns out that by building on a variety of deep mathematical connections, you can actually sample from this exponentially large space in polynomial time. And not only polynomial time, in essentially linear time. Essentially. It's kind of remarkable. So that was like the bulk of the work of two of my PhD students earlier this year. And then back to optimization. I will comment on this again on the next slide, actually. But let me actually draw one picture. This is a lecture room after all. So in the context of non-convex optimization, I am forever interested in thinking of what could be global structures in our mathematical models that we could identify so that despite non-convexity we could still get tractable optimization. Of course, there are many piecemeal structures that do exist like in combinatorial optimization there is stuff based on matroids, et cetera. Or in non-convex optimization, for people in control systems, they are deeply familiar with so-called S lemma, structure which does the trick for you. But more broadly, if you ask this abstract question at this level, if you have a non-convex problem for which any local optimum is a global optimum, is this really a non-convex problem? Should it be solvable tractably? Or should it not be solvable tractably? And it turns out it's not so easy to answer this question, by the way. But it turns out that if it is-- it has this property that any local optimum is a global optimum, module of some details, forget those, then there exists-- this is just the math part-- then there exists a reparameterization of this problem so that it starts looking convex. But not convex in the usual sense. So I'm just going to draw one picture to tell you, so that you-- take that picture with you. For those of you who have heard me speak, you may be familiar with this picture. For those of you who are not, I think this is a valuable picture to take with you. It broadens how you think about convexity. So this is x. This is y. This is a point, 1 minus t x plus t times y. Pardon my world famous handwriting. So convex functions satisfy this inequality. This is essentially the definition and consequence of definition of convexity, that the value of f at any point along that line joining x and y is upper bounded by the arithmetic average at the end points. Well, I drew a line-- AUDIENCE: [INAUDIBLE] SUVRIT SRA: Yeah. Thank you. Thank you. Thank you. Thank you. Typo. So the cool thing now is what if I joined x and y not by a straight line, but by a curve which goes from x to y, experiment raised by t. Suppose this curve happens to be the shortest path on a curved space like a manifold or some other nonlinear space. And you, again, have the same interpolating property, that at any point on the curve, the value of your function is upper bounded by arithmetic average at the end points. Well, then you get what is called the theory of geodesic convexity which has been deeply influential in mathematics. But for optimization, it turns out to say, OK, now these are-- there's is full family, a very rich family, of cost functions that are not convex under the usual Euclidean lens. But they happen to have a curved convexity. And if they happen to have this curved convexity, can we build a theory of polynomial time tractable optimization for them? Because if you go back to the history of convex optimization, one of the most significant achievements in that theory is cocktail level speech, that convex optimization is polynomial time tractable, et cetera. Can we make such a statement for this much richer class of problems? That's an open question. But first results on that direction come out in the global complexity results for money and optimization in my PhD student's thesis. And for those of you who have been following along stuff in optimal transport, many of you may be familiar with optimal transport, you may have heard of it. Last Field's Medal went to optimal transport guy, before that also it went to an optimal transport guy at some point. One of the most important results in there is what they call displacement convexity. Displacement convexity is a special instance of such type of convexity. So just saying that this concept actually has been around in math, but for optimization we are only beginning to explore its ramifications. So let's see how the forward works. So I kind of actually already told you about the present. Let me just mention one or two concrete results as additional takeaways in case you don't like these pictures. Maybe you like those ones. This is actually joint work with the Ali, who's sitting here, and our students, studying theory of deep learning. For instance, a result that I'd like you to-- one of my favorite recent results on this kind of theory is we're trying to understand how good at overfitting are neural networks from a formal perspective, like coming back to, say, Peter's talk. And it turns out, a very interesting property, if you have n training data points, regardless some finite dimensional space, then there are two hidden layer neural network using these [INAUDIBLE] nonlinearities that Peter also mentioned. Can perfectly memorize your data. That means you can get training error 0. And this is a tight result. So there's a necessary and sufficient condition on the size of the network. This is not saying, OK, send the network size to infinity. You have a universal approximate. It's a very concrete result. This is a practical sized neural network. And it has the power to memorize your data perfectly. And we're trying to understand, somehow, the capability and the limits of these networks and while we're in this regime of saying, oh, yes, overfitting is not necessarily bad, but to formally understand when is it guaranteed that overfitting is possible. And how much beyond the theoretical lower bound on overfitting do we need to make it big so that we can also make statistical statements, the kind that Peter hinted at, and so on. So I'm not going to go through this. Let me just make a quick comment on the future challenges. So fortunately, the first panel on future challenges, Peter already talked about those. But let me take a step back and say that given how machine learning, data science, and practical statistics, how widely they have expanded, some much bigger topics should really be our focus of the future. So of course, progress comes by working on fundamental stuff block by block. But it's very important, in my view, to keep in mind these bigger goals like can we now translate our ideas into important applications in science? Broadly-- physics, chemistry, biology for instance, medicine, and so on. Crucial questions, kind of hinted a little bit by Constantine. But more broadly, what are the ethical aspects of the research we are doing? You're building these machines with decision capabilities influencing lives of peoples in uncontrollable manner. What are the implications on ethics and discrimination and fairness, et cetera? And I'm not going to go through the whole list. But two more very important things in there. As we translate the progress from all this learning optimization methods into a wider spectrum of problems, we do need to come back to the questions that robust control has grappled with. Can we rely on these systems? How to deal with the robustness, safety, adversarial, all sorts of important concerns. And finally, I'll stop by saying one of the ways I believe by which we can tackle the tremendous difficulties in sample complexity is by incorporating knowledge from causally informed models to actually help us pick better models, than just trust everything to the machine. So with that, I'll stop. Thank you. [APPLAUSE] AHMED TEWFIK: For something a little different, and perhaps some heresy, but I've run my slide by Caroline and she said it was fine. So what I'd like to talk about is something a little bit different and basically trying to go from just thinking of machines to let's think of machines plus the human being as one entity. And let's see what we can do with that. And in particular, there is all of this talk about AI machine learning putting out of jobs. And what I'd like to say here is that perhaps AI machine learning would render us more creative and more innovative as opposed to put us out of jobs. So I'd like to run a few things that we've been working on or that we've been exposed to and use that as a motivation. And there will be almost no equations. So this is the heresy in this talk. But I can guarantee you that there is some interesting results to be obtained there. So a few years ago, we were funded by British Petroleum to look at sort of after the Macondo Event, which you remember what happened is, was a series of mistakes that ultimately led to this explosion and loss of life. And if you look at that particular event, it's not like people were given some tough problems to solve. So they weren't given the type of qualifier problems that we used to get as graduate students. But what happened was a series of simple mistakes that we all do all the time-- you know, Alan pointed out to me, I was looking at my title slide there. It had a typo and I never saw it. And you're driving and you see the red light and you go through the red light. And so that's what happened. So there was a series of events. Some of them are mistakes made by a single individual. Some were mistakes made by a group of individuals. But none of it was complicated. And the final mistake when the person realized that the driller made that mistake, it was a little bit too late. I mean, they had a few minutes to live essentially at that point. And so the point is, wouldn't it be cool if, say, my Apple watch in some sense could tell me that I am experiencing some serious cognitive biases at this point. And then somehow, with augmented reality perhaps, there's a way of presenting me with the right information in the right order so that I make the right decisions. So another example is some of you flew long distance to come here. And you were wearing your noise canceling earphones. And the flight attendant comes and starts to speak to you. And of course, you can't hear what the flight attendant is saying. And then you start fumbling with all of these buttons or you try to remove your headphone. Wouldn't it be cool if your headphone would realize that you're no longer listening to the music, you're attempting to listen to someone else, and then automatically cut the music. Or wouldn't it be cool if I was talking to Siri, let's say, and I say Siri, what's the weather like in Cambridge. And I actually pause and I say, tomorrow, wouldn't it be cool if Siri understood that I was going to continue. And of course, the answer, what's the weather like in Cambridge, which implicitly would be today versus tomorrow, could be very, very different as we've experienced in the last couple of days. So in applications like this, the point is-- and I'm trying to think of man and machine as one. In the first instance, as I'll say a few words later on, I can try to solve this problem without explicitly trying to send a human being. Meaning if I know what kind of information that human being was exposed to, that can give me enough hints for me to decide what to do and how to interact with the human being. And in these applications, I'm not trying to come up with a generic model for the human being. I'm very interested in this specific human being at this particular instance in time. And this latter application, I really have a sense of the human being. So there is some interaction, in this case trying to sense our, quote, brainwaves but without implanting electrodes in your brain. I'm trying to do it using sort of the hardware that you're using, meaning embedded in this noise canceling earphone or your airports. Things can get more complicated. And I'm not going to get into this because, for example, if we were-- god forbid, one of us would experience a spinal cord injury then the communication between the brain and what's below that spinal cord injury is completely lost. And that's kind of a problem because for some of our organs, we tend to have multiple control loops. And so for example, for my bladder, there is a local control loop that we're born with. So if the bladder fills up to a certain point, we void. But there is another control loop that comes from the brain. And that control loop is what allows us to behave the way we behave now as adults. But on the other hand, if that control loop is severed because they had a spinal cord injury, then this no longer works. And in particular, as a result, the voiding is not complete and so you can have urinary tract infections and the like. So in this case, what I'd like to do is not only get the signals from the brain and then send them back to the bladder, but I also need to send the controls signals or the sensing signals from the bladder back to the brain. Similarly, if I replace my-- if I lose, again, god forbid, my limbs and then I have artificial limbs, when we walk the sensing that we're walking on a flat surface versus perhaps some gravel is what helps us maintain our equilibrium. So all of these problems are problems in which you need to think of man and machine as one. And as has been pointed out in a number of talks before, people really have thought about these things long, long time ago. So this Licklider was actually a psychologist and a computer scientist here at MIT. I'm not sure whether he was in the predecessor of LIDS or not. But he is someone who looked at these problems and then later on moved to BBN. And he articulated this vision of let's start to think of man and machine as one and what can we do with this. And again, this is not a question of the human being losing control. It's just a question of augmenting our abilities even if we're, quote, normal, to the extent that any of us is normal. So there is evidence that actually this is quite powerful. So this is taken from the introduction to a chess book written by Garry Kasparov around the mid-2000s. And this is a particular competition that happened around 2005. And it was a strange competition, not strange. I mean, it was a different competition in that it was an online chess competition. And we didn't ask you-- you know if it was John Tsitsiklis playing we didn't make sure that it was John. It could be John, it could be a group of people, or it could be a group of people plus machines. And some of the things that came out of this particular game were not a surprise because it had been established by then that the best machine actually beat the best master chess player. But the surprising result was the team that won was not a team of master chess players with the best machines out there. They were actually a team of amateurs with some average machines. But the only difference is that in their case, the machine was adapted to the person. So they had worked on the interface between the machine and the person so that the machine could present to the person its intuition-- or not its intuition, its analysis, I guess. And the people could then, embed, into the machine their intuition. And there are many, many results along those lines so there are studies that were performed on a couple of hospitals in the DC area in which you can show that the best machine learning algorithms actually beat your best pathologists and radiologists at detecting certain types of cancer. And I vaguely remember my error rates for the humans being in the 4% to 5% range, and for the machine in reading in the 3% to 4% range. But then if you combine the two in the right way, you can bring these error rates further down to 2% or less. So going back just to illustrate how this might work, so going back to the BP example. We were in an oil rig. And the oil rig is heavily instrumented. So everything gets measured. And beyond the oil rig, you also have all sorts of data because you have information about the currents in the Gulf of Mexico, you have atmospheric information, et cetera. So all of this information is going to then go into a set of algorithms, so your best machine learning algorithms. And they're going to make some predictions. They're not going to make some decisions for you. They're going to make some predictions for you. And then all of this information then is fed to a person-- or the person that we were studying was called the driller. The driller, essentially, is the person in charge of actually draining that particular well. But that driller also controls a group of other human beings who are doing various things on the oil rig. And there is a ton of information that is sent through this particular driller, all the time. So there are big screens and we all sorts of information there. And there are groups on the oil rig who also are looking at the information. And there is another group onshore that's also looking at the information, obviously with the delay because there's a delay in transmitting this information. And so the idea behind this project was, we're going to take all of this information and we're going to let the machine make decisions based on the data that it can understand. Not decisions, but again, predictions. And we're going to show that to the driller. And then we're going to make certain decisions on what exactly to show the driller in what order. So for those of you who have cars that have heads up displays, you know that if you're driving down the highway the car is just going to show you, say, your speed. That's the only information that's really pertinent to you at this particular point in time. We don't want to clutter you. We're not good at dealing with lots of data. Then as you approach and you're at certain points and you're going to go through certain maneuvers, then it's going to start to show you the minimum amount of information that you need in order to perform the maneuver. Here it's a similar story. We're going to take the data. And then we're going to not-- the data continues to be displayed because for liability reasons, OK, we didn't really block anything from the driller. But we're going to show the right information at the right time. And so that's essentially how it works. Now how do you go about doing this? Well, so philosophically, going back to the lesson that I learned from Alan as a graduate student, the idea is we would like to come up with models for the cognitive biases that we all have. And these models are not necessarily models of how our brain works. But they are models that are good enough for us to engineer the interventions that you want to engineer. And that's not something that's sort of strange to engineers in the sense that, as a community, that's how we approach the problems of audio image and video coding back in the late '70s, '80s, and '90s. Basically, we didn't try to understand exactly how the brain works or how our eyes work or how our ears work. However, we understood from the psychoacoustic literature, from the psychovision literature that there are phenomena called masking which if I play one tone and I play another tone, and if I play the other tone at a certain magnitude, then you're not going to be able to detect the first tone. And then using that, we then determined by taking a signal and analyzing it using these models, determining what information is really important and where to spend my bits versus other information that's not important and where I don't have to spend my bits. So the idea here is the same thing. I would like to take the cognitive biases that we all experience, so we all think we're rational thinkers and we're great. But that's not true. Even if I tell you that we're going to test you and you're going to still be prone to these cognitive biases. So as an example of a cognitive bias, if I ask you how much money would you like to pay for a bottle of wine from Alan's cellar. And before you gave me your answer, I ask you to add the digits in your cell number, which have nothing to do with wine. If you end up with a large number, on average, you're going to bet a-- you're going to give me a larger sum. And this is a very well known phenomenon called the anchoring bias. And we are all prone to it. So next time that you look for a job and they ask you how much you expect as a salary, make sure to throw in a large number that's realistic. Because on average, you're going to get a larger salary. So the other problem that you have here is that as human beings, the order in which we see data affects our decision, because it affects the cognitive biases that we're going to see. And again, as an example of that-- well, I'll come back to that later. So that's the only slide that has equations. Because the cognitive biases were defined by the psychology literature as departure from the Bayesian approach to decision making, we have a good starting point in which we can take the mathematical sophistication that we have and then try to embed the empirical knowledge into something on which we can act. So going back, again, to the first two or three lectures of the detection and estimation theory course that Alan was teaching, we were exposed to this binary hypothesis testing problem. And just to simplify things, if all of my data is independent, that [INAUDIBLE] statistics just by adding these up. And I compare with a threshold. If it's larger, I declare H1. If it's smaller, I declare H0. And the theory tells me exactly how to form my sufficient statistic and it also tells me, if I pick my threshold in a particular way, what it means, whether it's cost wise or my false alarm rate or whatever. Now you can think that and you can start to modify it to incorporate the effects of the cognitive biases on a human being. And one way of doing it is to say, well, every time the human being gets a piece of information, then that's going to be weighed by some weight and that weight is a factor of all of the data that human being has been exposed to up to that particular point. And then the threshold, it turns out also, is a function of what the human being, where it was exposed to. And if I had more time, I can take it-- so there are some cool results that you can establish. You can prove some bounds on the kind of performance that you can get. There are some interesting links that you can make to-- some basic computer science type of algorithms like the approximate subset problem. Because you have to modify it because now the order in which you see that data is different. And there are some core problems that you can solve. So for example in the fraud detection or the problem they looked at, you can then start to ask yourself the question, there's a lot of data. What should the machine do or the machine-- let's say. Let me back up. So in fraud detection, I can design machine learning algorithms to detect fraud. They're going to be as good as the data on which they were trained. So throw all of this data at it, train it, and then it's going to look at the data and make some decision. But if the person committing fraud is smart enough, then that person is going to try to come up with something that you haven't seen before and may even beat this machine learning algorithm. At that point the question becomes, what should the machines show you. What kind of-- what subset of the data should the machine show you as a human being in order to be able to solve the problem so that you're a combination with the machine. You can get, perhaps, to the 90%, 99%, 95% range. Now these problems are-- yes, I'm going to just take three seconds. These primaries are tough. So for example again, back to the 737 MAX problem, I'm using one of these systems, I don't know who designed the system. And I don't know what's embedded in the system. And furthermore, the system doesn't know me. So again, back to the planes. The plane lands and there's a different crew that comes in. Yeah, we can think of I log in and my profile would come in. But these are tough problems. And a lot of the theory that was developed here over a period of time can be extended to solve some of these problems. And I'm at least excited about some of this work. Thank you. [APPLAUSE] CAROLINE UHLER: Thanks. So I would like to thank all of the panel speakers for providing this very diverse set of views on the current, the future, and some of it also about the past. So I would like to open it up to questions from any of you. And then I can also ask some questions as well. AUDIENCE: Thank you for the talks. How might the landscape of composite complexity problem change if we challenge it with [INAUDIBLE] computing, neurocomputing, [INAUDIBLE] in quantum computing, and even integrating [INAUDIBLE] system for human, for empower and artificial intelligence system? AHMED TEWFIK: So I didn't quite get the question. AUDIENCE: The complexity to how might this landscape change if we use [INAUDIBLE] computing. AHMED TEWFIK: So all of-- I mean, as we walked in the building this morning I told you that you ask difficult questions. But I'll try to answer to the best of-- so I think that in all of these things, you need a real time intervention. It doesn't help me if I'm analyzing this information forever. So as computing technology continues to evolve, that's really what is enabling us to start to think about these things. But I think what this group of people is good at is as we develop algorithm or as we develop methodologies for addressing these problems is to give you some fundamental bounds on what can or what we can or cannot achieve. And I think that that's what's most helpful. Because then technology, as it evolves, will get us closer and closer to these bounds. CAROLINE UHLER: Any other questions? Maybe? Oh, yeah. You want to-- PETER BARTLETT: Just to comment on-- I guess I wanted to draw attention to something that Constantine said about the risks that we take in what we teach our graduate students, that was a really interesting thing. It's kind of fun to think about that the decision problem of what is it that we're going to work on in research or taking this longer term view what is it that we're going to teach our graduate students. And in some sense, we really should be risk seeking in those activities. We're more like a venture capital-- it's more like venture capital than a mutual fund. We want the sort of very occasional high impact things over the humdrum small advances. A lot a lot like the session yesterday when Ben gave his talk about this Thompson sampling view. Actually if you look at the paper that Thompson wrote in round about 1930, it was actually motivated by this-- you can imagine reading the paper that he's sitting back philosophizing about how am I going to decide what to work on. It's very much the same type of perspective-- blue sky-- what should we-- what might we hope to-- even if it's very unlikely. AUDIENCE: Motivated but what Suvrit said, this relationship between causality and causal models on one side and machine learning on the other, could you elaborate a little more or if panelists have a little bit more thought on that. I've sat here yesterday and today. I'm not sure that I've seen one single causal model, the equations, like the ones that we use in dynamical systems. It's a really big disconnect for people who work with applications. So any thoughts on that? SUVRIT SRA: So I'll make one comment and then probably the causal experts may have something to say. Because there's multiple views on thinking about causal models. But where I was coming from is the following, that you're often, when entering a new domain where you say, hey, I'm just going to solve everything using data, I don't need your models. That's a Silicon Valley style way of thinking about machine learning. Whereas if you enter a new domain where you don't have unlimited amount of data, which is quite common, by that I mean unlimited amount of labeled data, then having a better understanding and better formulation of the task you're trying to solve. So for instance, you could take, I guess something in science. You want to help somebody control their quantum computer better. It's a pretty complex physical setup which requires some deep knowledge from physics about how signals are being read out, how they can be controlled. And there's a lot of knowledge about how quantities interact with each other is what I broadly meant by causal models. And if you don't take those into account, you're kind of making-- it's a wasted opportunity. AUDIENCE: Actually my question was what is state-of-the-art on that. Is anybody working on that connection? That's my question. SUVRIT SRA: So I guess people brought-- several people I've talked to in machine learning they care about by saying, OK, I care what mechanistic models so that I can reduce sample complexity. But I've heard that from many people but only seen few examples of putting these two things together. One recent example I can mention to you, I ran into some reinforcement learning based control problem where somebody said, OK, trying to reduce sample complexity, they actually endowed their mathematical model with some approximate differential equation based model of the dynamics. And they could reduce their sample complexity on that toy task by hundreds of times. But that's like just a toy example. Something at a much bigger scale for general learning and safety and everything, I think people need to pay more attention to that, which is why I put it in the future must-think-about category. AHMED TEWFIK: I thought that Ben Stock yesterday pretty much did it. I mean, he had the causal part and the neural networks in the right place. CAROLINE UHLER: I mean maybe just a word about this like of machine-- before we'll hand over. In terms of causal modeling, I think machine learning has been very, very successful in going just straight to predictive modeling. But what it is really built around is that we just have observational data. And so the whole framework is in mind with just observational data. But I think what is particularly exciting right now in thinking about causality and how machine learning can maybe-- how we can actually bridge the link from predictive to causal modeling is that in many different fields we're actually getting access to actually interventional data. So that's what I find very exciting in genomics is that we have perturbations. Or even if you think in ads, like you have all these perturbations. You're getting to see all this interventional data. And this is, I think, what will let us in the end build also a causal framework and really bridge the gap between predictive and causal modeling. But we need to be able to interact with the system. Only then can we actually learn the underlying causal structure. And I think here there's a lot to do on really also bridging what people already know in control theory, with what we're doing in causal inference in particular, and really get machine learning together here. AUDIENCE: Yeah, so just building on that bridging question or issue is, where does the theory that many of you-- you all talked about bridge with the practice that's being done in machine learning now, which is much more heuristic? And when you look at the applications where machine learning has had a lot of impact on image recognition and voice recognition and translation, to me these are areas where we didn't have good mathematical models. So for example, used hidden Markov models for voice and then solved optimally mathematically, but it was a poor model. So is the place where we can bridge the fact that the machine learning gives us insight into new models, new mathematical models for those problems that we can then solve, is it complexity, reducing complexity, understanding more about training, how long you need to train, how to extract the right kind of training, where are the key areas to bridge and bring our mathematical foundations to the machine learning problems that will impact the practitioners that are using it now? CAROLINE UHLER: Peter, do you want to-- PETER BARTLETT: Sure. Yeah, so I think a lot of-- in terms of the settings where we've seen deep learning methods, for instance, be very successful, they've relied on an enormous amount of data. And there has been this kind of empirical observation that in these settings, the more data you have, the better a model you can build. And so I think they're not-- it's not like the case of building a physical theory where you can have some very precise model. And it's a really good reflection of reality. There are always nuances. There always more and more subtle things that you could include in a model, if only you had more data to reliably estimate it. And that seems to be the kind of part of the landscape where these methods have been really, really successful. That's very much non-parametric statistics. It's the domain of that area. I think it is really, really interesting problem to understand, if we are in a setting like a robotic setting where you have, for some part of that, you have very precise physical models maybe with a bit of uncertainty. For some other part, like what this robot is interacting with, all bets are off. But perhaps you could gather a lot of data about that [INAUDIBLE] I think bringing those two together [INAUDIBLE] direction. SUVRIT SRA: So I would add two comments to that. One being if we actually managed to understand why these models are successful, I guess then we'll be ready to actually say, OK, how to now build simpler better models with the other properties that we want. But we're not quite there yet. So once-- place where some of that's LIDS style research does directly connect is, fine, people care about these models. But right now, say, to train such a model, you burn so much energy how could you reduce that energy footprint? So that's a direct question for engineering, for hardware, as well as optimization. And I think like one of the places where theory is contributing already now is to, say, OK, I want to understand this model. I want to understand the kind of data that it works with. So what other characteristics in the data and what are the characteristics of the specific deep neural network architecture that I should abstract away and use those to guide how I design a training algorithm? Not just use the blind method that everybody uses to greatly reduce the cost. So that's kind of doable right now. The other things are a little bit further away. So I think by making these small steps, maybe we enhance our understanding of why these models work and then come up with better ones. Because I think somebody else also commented on this that-- or maybe even Peter said, I kind of-- at least few people have mentioned that these models seem to be working but somehow a priori there is no reason why this should be the only model class worth thinking about. But we're not there yet, clearly. AHMED TEWFIK: I don't see machine learning. It's not the magic bullet that solves everything and we may not understand exactly how they work. But we understand at least for some classes, some characterizations of that. So for example, again, being sort of a generalization of the sparse representation that we've designed before. And then once we have that understanding, now we can start to come up with perhaps a wider class for sparse reconstruction. These are some things that seem to be extremely helpful or trying to understand the partitions and then bringing it back to detection estimation theory. And then when you come to adversarial machine learning, then have some fundamentals of maybe what kind of robustness you can impose on these networks, and in particular with all of the redundancies they have. So at least that's the way I'm looking at it. AUDIENCE: Yeah. Thank you for insightful comments. I want to go back to Constantine's issue that he raised, which I think is really critical, and maybe hear the rest of the panel comment on it. So another way to state it is that we all want to address these complex complicated problems that require a certain level of sophistication mathematically, understanding of where the problems are impactful and so forth. And yet we are inside the wave of maybe very fast publications and posturing maybe in a field. And somehow does the community have the patience to educate the students to get them to that level, have them work on these hard problems? Maybe I'll publish one paper at the end of the five years and then have them be ready for this job market. How would you address these kind of questions? CAROLINE UHLER: Can I maybe add to this question before I give it back? Something that you also said on the empirical side, because many of the papers actually have empirical insights, and since we're also, when I work with undergraduates on empirical questions or also theoretical questions, I feel one thing is really lacking in the computer science education. And we see it also when we read machine learning papers is that we're not trained to do careful experiments. So many of these papers are just you have very heavy claims based on really, really weak experiments. So how do we go about that? And how do we go about actually pushing forward the field when we don't really know what is actually true and what is not from the experiments we have? CONSTANTINE CARAMANIS: I'm supposed to have an answer to that? I think that we are seeing a lot of what the students are demanding get commoditized pretty quickly. If you think about what-- if I teach a hands-on machine learning class, if I want to think back on what was good enough to get a student, an undergrad student, a job 5 years ago, if you could run cross-validation, use scikit-learn, basic Python, you were in great shape. Then after that, you know how to set up tensor flow. You know some cloud computing. Basics of PyTorch, that's good enough. But already things that were exciting projects, final projects for the class 3 years ago are now so easy to do in a couple lines of code that actually that's not important anymore. So I'm actually-- my hope is that things are moving fast enough that the core will be revealed. But I think it's something that we have to-- that we'll have to deal with somehow. AHMED TEWFIK: Basically as technology is automating a lot, and a lot of the things that we're doing, I mean I think the fundamentals and the core becomes more and more important. And in some sense, even as a student, and so they get excited by putting things together quickly and getting some cool results. But in a competition to get to the next level, they quickly realize that they really need to understand the fundamentals and they ask for it. So I'm hopeful that we actually will get more students. Because we don't know what job they're going to hold next year, let alone in 20 years or 30 years. And so the fundamentals become more and more important. And I think the students are realizing it. SUVRIT SRA: But I'll add to that. So if you OK-- what training should undergraduates get, that's I guess always a moving hard target. And it's easy to say that fundamentals, but sometimes the incentives are misaligned because as Constantine hinted, the companies, they just want some completely different skills from them. But a bigger challenge, I think, already for the field of machine learning is at the stage of graduate students. Because they are under tremendous pressure to get the next archive free print out. And that intense pressure, the only way to address it is by, say, being weak in experiments but having strong claims. And it's not something. It's like a broader thing because of, let's say, hype. But somehow we can do our share to at least, for our students, give them an environment, which LIDS always does, to value the fundamentals. That it's OK you know to not succumb to this intense pressure which they do feel from their peer world. And at least, a core of people who do care about this, that culture will eventually live longer, as I said. 20 years down the line, they will be actually thankful that they spent that effort. CONSTANTINE CARAMANIS: It's tough though because the publication numbers are very different than I think what they were even a few years ago. SUVRIT SRA: I mean, now you have undergrads who apply to MIT typically have two, three papers in NeurIPS. GUY BRESLER: Yeah. But I think the flipside to that is that people are overloaded and nobody has time to read all these papers anyway. And so what you get is a kind of self-correcting effect where people then really appreciate that you've spent the extra time to write this paper in a way that would be pleasurable to read. PETER BARTLETT: But we have a real role to play in that. In the last two or three years, I've found myself on a bunch of qualifying committees, qualifying exam committees, where I've been saying to students, you need to publish less. I think we have a responsibility to enforce that. It's not just that it's OK. It's actually good. Getting people not doing weak experimental science, getting them to spend a bit of time thinking about-- spend more time thinking about hard problems and-- AHMED TEWFIK: Also the reality is that companies aren't recruiting students from top universities like MIT to write a couple of lines of code. I mean, they're hiring them because of the deep knowledge and the creativity, the value that they're going to offer over many years. So hopefully we don't change that. CAROLINE UHLER: I would like to just maybe have one more question before we have to break, since you're waiting for a long time. AUDIENCE: So this is more like an observation, maybe, than a question. So over the past few days, we've heard this phrase, LIDS type or LIDS style research. So let's take machine learning. Probably the most perfect example of LIDS style work is [INAUDIBLE] invested uniform convergence, work for decades ahead of the time. And we have this [INAUDIBLE] vector machine which has a beautiful theory worked out. And so it went from theory to practice. But where we are right now is in an area where we're going from practice-- somebody like Yannn LeCun persisted for many years trying to make these things work. And so we're more like trying to explain experiments and observations more like physicists than the older style of research. Just an observation. CAROLINE UHLER: And this is exactly what I think is actually missing. Because we're going in the direction of like a physicist, we should also be trained. And because they are trained in performing very careful experiments and we're not. And I think this is really one of the dangers in this particular area. SUVRIT SRA: I think it's a great thing, actually, in fact. Because if you think-- if you look back at the history of all of mathematics, a lot of the questions there were invented to answer physics problems. You look back at Fourier analysis, it comes from the heat equation, et cetera. That's just a physics problem. Pretty much all the developments you look back at constructions in differential geometry, whatever then sort of whatever the kind of general relativity theory that builds on top of-- so a lot of the math has always been to answer things that people who are trained to understand in the physics world. And now, OK, we are not looking at physical models. We are looking at computational models. And that can inspire a brand new statistical and mathematical thinking. So I think it's actually a great thing that this is happening. GUY BRESLER: I think at the same time, the understanding of fundamental limits and the insights that one gets from that, we can hope that that can lead to better performance. And also in some situations, like in Ben Van Roy's talk, people genuinely don't really have good approaches. So we need the theory to give insight into that. So there's some of both. CAROLINE UHLER: Great. I think on this note, we can continue our discussions over lunch. So thanks to the panelists very much. [APPLAUSE]

Info

Channel: MIT Laboratory for Information and Decision Systems

Views: 514

Rating: 5 out of 5

Keywords:

Id: cfisGvUoHfk

Channel Id: undefined

Length: 88min 54sec (5334 seconds)

Published: Wed Dec 04 2019