CAROLINE UHLER:
Welcome back, everyone, to the panel From Data to
Inference and Machine Learning. It's a pleasure
for me to introduce the panel for this morning. Unfortunately, I should announce
that Emily Fox was supposed to be on the panel
and due to a family emergency is not
able to be here. But I think we have a
really wonderful panel here together today. So we'll have four short talks. So first we'll start
with Guy Bresler who is an associate
professor here at MIT in the Department of Electrical
Engineering and Computer Science, and in particular
a member of LIDS. His research is in high
dimensional problems, in particular related
to or in the context of graphical models. Then we'll have
Constantine Caramanis who is in the Department
of Electrical and Computer Engineering at the University
of Texas at Austin. He does a lot of different
kinds of research, in particular on problems
around learning and computation and very large scale networks. Then we'll have Suvrit Sra,
who is also faculty here at MIT, associate
professor in the Department of Electrical Engineering
and Computer Science, also a core member of LIDS. His research is mainly at the
intersection of optimization and machine learning. And finally, I don't have to
introduce Ahmed Tewfik again. He will also give a very short
overview on what all of us think about the present
or the past, the present, and the future in the area
of from data to inference and to machine learning. So with that, Guy, I
let you take the podium. And so since we have
one speaker less, we'll actually be not
so pressed on time. So we'll have about 15
minutes per speaker. And as in the last
sessions, let's just have all the
questions in the end since we'll have a whole
half an hour actually for just discussions. GUY BRESLER: Great. Thanks, Caroline. Good morning, everybody. So I'm going to talk a little
bit about a certain question, a set of questions at the
interface of computation and statistics. And I'll describe a
particular set of questions that I've been obsessed
with for the last few years that I believe is a real
goldmine for information theorists and probabilists and
really LIDS type of people. So before I jump right in,
the high level objective is to try to simplify
things a little bit, to try to get at the essence
of some of these problems and to simplify the landscape
that frankly, to me, is a little bit of an
overwhelming landscape. So it's about high dimensional
statistical estimation. There are many, many,
many estimation problems that you might be interested in. I'll just start
with one that has been of interest to many folks
in the scientific community, which is trying to
distinguish whether there is such a thing as the
Higgs boson or not. So you do a bunch of experiments
in this kind of universe. And then you collect
your data, maybe you tally counts of energy
versus the number of collisions at that energy and so forth. And then you collect your data. And then somebody somewhere
has to decide, yes, this is indicative of
having a Higgs boson or hm, no, probably not. It's just some random garbage
and this whole giant collider thing was just a waste
of billions and billions of dollars. So one wants to decide
this sort of thing. And somehow the point is that
the signal is quite weak. And so it's a challenging
sort of estimation problem. Because this is just
one estimation problem. Everybody has their own
favorite estimation problem. And simplifying at
the very high level, you could think of all of them
as roughly signal plus noise. And your task is to estimate
the signal and the noise. Now of course, the
details of the problem change from setting to setting. Each noise has a different
distribution, maybe different characteristics. The signal has different
characteristics, different
combinatorial structure that you might assume on it
that allows you to carry out the estimation and so forth. But at the high
level, this is maybe how one might think about
statistical estimation. Let me describe another concrete
mathematical formulation of a statistical
estimation problem. And I'll describe a
detection version of this. The goal here is just to
detect, is there even a signal. Or is it pure noise. And everything I say
in the next 10 minutes will be about the sort
of detection problem. But the same ideas will
apply, sort of verbatim, to estimation problems. Let's see if I've got a laser. Excellent. So here's the problem. This is called sparse PCA. You've probably all heard of it. This particular
model for sparse PCA is a probabilistic model
for understanding algorithms for sparse PCA. And it's called the
spiked covariance model of Johnstone and Lu. So the idea here is that there
are two possible versions of reality. One is the pure noise. And you observe n samples
from isotopic Gaussian. The alternative is
that there is a signal and you observe n samples
from a multivariate Gaussian or now the covariance matrix
is no longer identity. There's this rank 1 spike here. So theta is a signal
strength parameter. It's telling you how
strong the signal is. u here is the
direction of the spike. And in the sparse PCA
problem, the spike is sparse. So there's k non-zero
entries in the spike. Now your goal is to distinguish
which of these is the case. And we'll say that you've
succeeded if asymptotically as the parameters, as the
dimension number of samples and so forth scale
together, you'll say that you've succeeded
if the probability of error is going to 0. Now there's this
the average case that appeared in the title. And I'll say average case
again and again and again. All average case means is
that this is a problem defined over probability distributions. So the idea is you
have some data set that you're trying to measure. And the data is not
adversarially generated. It's just generated by some
process that you're observing. And so, average case
refers to the problem input being described by a
probability distribution. So we're information theorists
and control theorists and stuff. And we want to
understand, what are the fundamental limits of
estimation or detection of this sort of problem. And so one can try to
plot the feasibility of whether one can do this. And so here we have on the
vertical axis, the signal strength. N to the minus beta is
how it's parameterized. As beta gets larger, the
signal strength gets weaker. On the horizontal axis, we
have a sparsity A that's parameterized n to the alpha. So as alpha gets
bigger, the number of non-zero entries in
the spike get bigger. And this is just
specifically for sparse PCA, the problem I just described. And I'm going to
have d equals to n, so the dimension
that we're living in is the number of samples. And that's just so we can
plot this in two dimensions. And this problem is
extremely well studied. And one of the landmark
papers about this problem is that of Berte and Rigollet. Rigollet being our
colleague in LIDS and in the math department. And they asked, well,
when can we solve this. And they came up with the
falling phase transition. If theta sits above here,
then the problem's impossible. There's just not
enough information to solve this problem. If theta is below
this or beta, I guess, is below this, then theta
is big and one can estimate. This is good. This answers the question from
the information side of things. And the next question is,
well, what about algorithms. And they analyze
some algorithms, semi-definite
programming, and other. And this is what was
achieved by the algorithm. And it requires a
bigger signal strength. There's a k squared here
instead of k, that's information theoretically necessary. So at this point there
are two options-- either try harder and come
up with a better algorithm or give up. And it turns out that the right
answer here is to give up. Not usually what's recommended. But here you definitely
want to give up because you're not going
to find a better algorithm. And they showed this by reducing
from a conjecturally hard problem technically
to sparse PCA. And they derived
this hard triangle. And what this hard
triangle tells you is that, well, at
least for this corner point here, there is no
hope of improving the signal-to-noise ratio or
the number of samples required by efficient algorithms
under this hard conjecture and this hardness conjecture
that's widely believed. So subsequently--
or not subsequently, some of them before. But there were
dozens of algorithms analyzed for this problem. There's many, many researchers
really focused on it. They analyzed all
sorts of algorithms and different regimes,
depending on the regime, different algorithms are
optimal or seemingly optimal. And then that's
really the question, is what happens in
these blank regions. And I jumped into the fray
with my student Matt Brennan and filled in a little
bit of these hard regions. And at this point, you could
say, well, this is great. We've fully characterized the
feasibility of this problem and all parameter regimes. But the dissatisfying
part of this and the kind of troubling part
of this is that this is a huge amount of work. And this is quite daunting
to try to then think about how in the world
are we going to do this for every other high
dimensional statistics problem. And there are many, many
such problems out there. For instance, graph clustering
problems, tensor estimation problems, various regression
problems, cryo EM, all sorts of scientifically
motivated problems, et cetera. So the sort of
motivating question here is, do we have to go
through this whole thing with dozens and dozens
of people working really hard for many years to
try to understand each of these problems separately. And when I say
understand, I mean exactly this trade off between
computational complexity and statistical complexity. So maybe a hope, a
sort of suggestion from our friends at the
other tower in Stata is maybe we can aim to simplify
the landscape a little bit. And so this is the complexity
zoo, as it's called. You can go, there's a
wiki page like this. And the complexity
zoo, what it does is it classifies
computational problems into different classes. And currently there
are apparently 544 classes and counting. So maybe there will be
a new complexity class to be discovered at some point. And the point is that
within a complexity class, all of the problems are
strongly and formally shown to be equivalent. So if you understand how one
of those problems behaves, then you understand all of them. And that's a beautiful thing. Because, well, I can't
store all of this complexity in my head, all of these dozens
of high dimensional statistics problems of interest. And this is a much
simpler landscape than one might hope for. So how does one reason about
equivalence of problems? Well, the bread and butter
of complexity theory is arguments that are
done by reduction. So reduction just means you're
taking one problem of interest, for instance SAT
and you transform it to another problem of interest,
for instance independent set in a graph. Of course, the mapping that's
doing this transformation has to be done in
polynomial time. Otherwise you're not able
to draw any conclusions about the complexity of
one versus the other. And this is really the approach
put forth in Karp's landmark reducability among combinatorial
problems paper that really got this field going. Now if one zooms in and looks
at how is that reduction that I just mentioned
actually done, well, it takes a
three-step formula and it produces a
very particular graph where there are three
nodes for every clause. And these clauses
are linked together in a very specific way. And people generically refer
to this kind of construction as gadget-based construction. And if you think
about it for a moment, you'll realize that
even if you started with good distribution
or natural distribution over SAT formulas, for instance
uniform over three SAT, what you end up with output is
from a garbage distribution and has nothing to do with
any distribution over graphs that might make sense to
somebody studying graphs. Because these gadgets
are there, one has sort of a structured
output distribution. So these are reductions that
are tailored for worst case complexity. Because they're trying to reason
about those sorts of problems. And unfortunately,
they don't work for the sort of
average case problems that we're interested
in in statistics. I will say there is this
landmark work of Levin that puts forth average case
complexity as a field, really. But that whole theory
is really tailored towards completeness
results for NP problems, for distributional NP problems. And so it's ambitious
in what it aims to do. And for that reason,
things haven't really progressed as much
as one would hope. So there have been
many, many approaches in the literature
over the last decade or so to try to understand
the interface or the interplay between computation
and statistics for these sorts of high
dimensional problems that I'm talking about. I won't really describe these
in the interest of time. But I will say that there
have been a few papers on average case complexity. But because the
average case reductions that preserve the distribution
from the input to the output can actually land
you on the problem with the correct distribution
that you're trying to map to, are challenging to get. These approaches have kind
of flourished so that they're in some ways easier to
get the predictions, these feasibility diagrams
that one hopes for. Nevertheless, there are
some huge advantages for average case reductions. Primarily, that they really
connect to the problems that one is mapping one
to the other directly. And so they simplify
the landscape. Instead of studying each
problem in isolation and repeating the
whole machinery of modern algorithmic analysis
for every class of algorithms for the new problem that
you wish to understand, the dream is that you
can just say, well, this new problem that I
don't really understand is intimately connected
to this problem that I fully understand. And you can then transfer
that understanding. So it was at this point that
I call on all of the audience and people from LIDS heritage
and LIDS-related heritage to say that this is really an
information theory problem. One is transforming a
probability distribution, essentially, into a different
probability distribution. And the way that
one does that is via an average case reduction,
which is really just a channel. And so you can think
of it as designing a channel that has as its
output the thing that you wish to have out there. And of course
computing the channel has to be done in polynomial
time and so forth. So there's a little bit of
an algorithmic flavor to it. But it's really an
information theory problem to relate to problems
of this form. Now another point and,
this is maybe an aside, is that average case complexity
for these combinatorial problems, as I said, is
notoriously challenging. And there's another
thing that we have going for us here which
is that all of these problems have an SNR parameter. And when we're carrying out
a reduction from one problem to the other, we
can allow ourselves a little bit of loss
in SNR and still get interesting and
meaningful conclusions about the relationships
between the two problems if one is losing a
negligible amount of SNR. So that, through
continuous parameter, gives us a freedom that I
think makes a big difference. So building on landmark papers
of [INAUDIBLE] and [INAUDIBLE].. These are some of the reductions
that my student Matt Brennan and I have obtained. And the point here is
just to say that this is a proof of concept. The average case reductions
really are a fruitful way to move forward. And really, this is just
kind of the first tiny-- not the first, but a first-- a tiny step in the direction
that I'm advocating for just to say that there is hope, that
it is a useful way of thinking about these problems. Let me just conclude by saying
that we're hugely optimistic but there is a lot left to do. And among the many things
that one might dream about are, firstly, understanding
some of this zoo of problems that we haven't accessed, which
is the vast majority of them. One would like to understand
equivalencies between problems and a very strong sense. These 25 problems are really
the same problem at their core. And ideally a more general
theory to understand, are there different
classes of problems and how do they relate-- and I'll end there. [APPLAUSE] CAROLINE UHLER: Thank
you very much, Guy. So maybe we'll take all
the questions at the end. CONSTANTINE CARAMANIS:
Good morning, everyone. It's really great to be here. I want to thank Caroline for
organizing the session and John for everything and inviting me. It's 22 years ago this spring
I was fortunate to take 6262 with Bob Gallagher. That was my first-- my first contact with MIT. And then later that
year, I was able to work with Dimitri and John actually
on making some probability problems. I was just an undergrad
then for the undergrad book. And then when I finally-- when I finally joined MIT
as a graduate student, Sanjoy really took me under
his wing when I joined LIDS. And I'm extremely
grateful for all of that. So where's Alan? So Alan, I didn't really get
to interact too much with him. But there's a good
reason for that. And it also left me with
one of the important lessons that I carry with me in
my life which, is that-- and it really shows
how caring LIDS is. And it's a one-stop
shop for everything. You come here to learn. But you also come
here for life advice. I have stayed in shape, tried
to stay in shape because of Alan because I never once
saw him not running. In the old building, in
the metal processing lab, whenever I saw--
passed him in the hall, he was just always
running, running, running. So LIDS has been extremely
important for me. And it's really wonderful
to be back here, such a supportive,
such a friendly place. And on the topic of
friendly, I have to say, Guy, that was a wonderful talk. I've never learned so
much in three slides in my entire life, which was
what our instructions were. So that said, let
me figure this out. Which way is forward? So I want to talk about
LIDS and my experience and in particular, machine
learning before it was cool. And I think after,
it will be cool. I think many of us are
eagerly awaiting that moment, even if we're
working in that area. So I want to mention one of the
topics that, to me, has really clarified my thinking
about a lot of problems in particularly in the area of
high dimensional statistics. And it's the contributions
have indeed come-- I didn't put any citations
because, like I said, I have friends and I
want to keep friends. So I just left it like this. Those of you that are
familiar with the area are going to recognize
who's been responsible for a lot of these
contributions and a lot have come from LIDS
and LIDS alumni. So my particular
view of the world is always colored by
thinking of uncertainty and starting with modeling
uncertainty, which has been core to what
we've learned in LIDS. And when I think about one of
the key questions in machine learning, we can think
about it in those terms. So machine learning
is really how do we think about
model and deal with it from an algorithmic perspective? The uncertainty in the
future distribution, so what samples
we're going to see, given the empirical
distribution, the data that we have versus
the data that we'll see. And one of the areas
where this story, I think, has at least brought
some light in my mind is the role of
convexity as we see that in high dimensional
M estimation, high dimensional statistics. So the problem is that we
have had a huge success in this community. And at this point are
considered classical, things like compressed
sensing, matrix completion, and all of those problems
in that surrounding space that many people here have spent
a lot of time thinking about. So what is the role of
convexity and how does this relate to uncertainty? So one of the key lessons that
high dimensional models force us to think about is that
the empirical distribution for a high dimensional model
where you have many fewer samples than the
dimensions can actually be very, very different than
what you're going to see next. Yet somehow, there's
some stability, which allows us to do
inference and to solve problems in this space. And the work that I think
started in compressed sensing but was really sharp and
later on with concepts like the role of restricted
strong convexity and so on, illustrate that what
convexity buys us in this case is it allows us to control and
to bound the important ways that the empirical distribution
differs from the population distribution. And in my mind,
it's these results that have given a way forward
to a lot of these problems. So the high level lesson and,
I think, one of the, again, one of, for me, one of
these recent successes that's come out of LIDS and
alumni and related community is this high level lesson
that convexity gives us this connection and
provides stability that help us control this uncertainty. So looking forward to what are
the present challenges now. And what do I think is
exciting along the same lines? Of course, the whole
world is exciting. So I'm just choosing
a very small slice. And I want to continue with this
with the same story of trying to understand the
uncertainty between the data that we have and we
can see, and then what is going to come next. The central problem in
any prediction problem. And as Peter's talk this
morning illustrated, one of the main
challenges is in coming up with an understanding,
new representations that are going to allow us
an understanding what is a new structure that
we need to exploit. So problems that I've
been really interested in, and also many others
here, are problems that have to do with
heavy tails Peter also talked about that
this morning, but also problems where some of
your data may be corrupted. So this problem has been
around for a long time in the statistics
literature, of trying to deal with corrupted data. Everything needs a good
new name once in a while. That's the case now that we
think about neural networks. So corrupted data
at test time, these are called adversarial examples. Probably many of you have
seen this illustration of how fragile state-of-the-art neural
networks are even for image recognition where small,
almost imperceptible changes in the image can lead us to-- can completely
misclassify the examples. And then also attacks
at training time, which is which is now
called data poisoning. So in the context of
modeling uncertainty, I think one of the
challenges here is that we no longer
have this tool that-- we were in search
of different ways to bound and control
this difference between the data that we
have and the data that we're going to see to model
this uncertainty. I think this is one of
the main challenges. And I want to point out that
even in problems that we think are simple, when you add
something like heavy tails or add a few corruptions,
even things like solving linear equations
with heavy tails in a high dimensional setting,
is a challenging problem, even though this is so
close to problems that we've been working
on for such a long time. So I think that understanding
of this perspective and working on robustness is one
of the exciting things that's going on right now. So this is some
present challenges, we were asked to speak about
some present challenges. But aside from
technical challenges, I think that this community
faces other present challenges as well. And I want to just
devote one slide to that. I think that, as anybody
who's served on a recruiting committee, faculty
hiring committee, graduate student
committee knows, the problems of inclusion
and broadening participation in our area are extremely
challenging and difficult to address. But I think that there's
also a huge opportunity that the frenzy on neural
networks and machine learning presents for us, and just want
to mention a few of my thoughts on that, with respect to
undergraduate research. So LIDS has been, as [INAUDIBLE]
mentioned, focus on rigor. It's very mathematical. And many of-- it's
very difficult to start on research until you're a first
year graduate student or later. It's difficult-- I found
it difficult to involve undergraduates in my work in
some kind of meaningful way. And so the problem is if
we can't involve students in a meaningful way in
research until they've already passed all the hurdles that
are hurdles to inclusion and broadening participation
then this is a problem. So how do we maintain
that rigor and do the work that we want, but also are
able to meaningfully involve undergraduates. So the reason that I think that
the current hype, let's say, on machine learning
is exciting is that there's so much
work that's empirical that we may involve
undergraduates that haven't gotten to
the levels of math that we need for our work to do. And I think that this is a
really great possibility. And one thing that I
would like to encourage, those that haven't
thought about this before, is to let undergraduates
play around with empirical problems
in neural networks. So much of what's
exciting is basically empirical in this area, even
if you don't consider it as mission critical in what
you're going to publish. So this is something
that I'm trying to do. For those of you
that have done it, I'm really eager to hear
what your experiences have been in lessons. I think this is an
opportunity, for this community in particular, to stay
focused on what we're doing but also really broaden
the population of people that we can get interested
in our problems. Something else that I think is
an important present challenge has to do with
cultural publishing. And as I was typing this
out, I was like, gosh, this is like an old guy
rant that I'm going on. And it's changed dramatically
since the good old days when a LIDS graduate
could get a faculty job without any publications. But now I'm sure many
of you feel this. And it's very difficult
to change going from cycle to cycle, submission
cycle to submission cycle because you need to do
it for your students, if you're not doing
it for yourself. So there's something--
there's a momentum that I'm unclear on how to fight. On the other hand, I
have to say that I'm extremely thankful to
several anonymous reviewers for consistently rejecting one
particular paper because now I actually think that we
were onto something. But, you know, I say that
jokingly but part of me thinks like, then why did I
submit it if it wasn't ready, especially in all of these
venues that are the final word. We're not submitting
journals after this. So anyway, I think that
this is a main challenge. Another challenge that I think
is really important for us to think about is who's
setting the educational agenda in an environment where industry
is pushing ahead so fast and also students are
voting with their feet in a way that impacts all of us. So a LIDS traditionally
sits on the EE side of EECS. And I'm in an ECE department. And as we all see,
just the evidence says who's getting hired. Where are our students going? They're going to a
computer science. And things change. But we're all very
influenced by this. Just look at how many
places have tried to rebrand themselves, how many
how many departments have rushed to add
machine learning courses or rushed to put data science
somewhere in their name. And so I think this is
an interesting question. Looking forward, I
think that academics are generally conservative in
terms of how much we speculate. But we actually are speculating
without realizing it every time we teach a course. We're speculating in
the most risky way because it's whatever
we teach, you can't go back and find students
that graduated two years ago and change your mind because
you've rethought what you're going to teach them. So we're basically making bets
right now that we can't-- we can't change our position. So it's very interesting. And I think we need to grapple
with that to understand it. So I have to say that
when I think about LIDS, when I was here
there wasn't really such an emphasis on machine. Learning, but we learned
about control and feedback and partial feedback
and optimization, distributed parallel
computation, dynamic programming,
approximate dynamic programming. And magically, those seem to
be exactly the right things. So I'm in awe of all
of the LIDS faculty that designed that curriculum
and had this foresight. And I think it's going to be--
it's a main challenge for us to think about how
we're going to do that-- how we're going to
do that again and be as successful as they were. Thanks very much
for everyone who's had such a great influence
on me and everyone else. I look forward to hearing
the rest of the panel. Thank you. [APPLAUSE] CAROLINE UHLER: Thank you
very much, Constantine. SUVRIT SRA: Hi. My name is Suvrit. I'm LIDS member. And let me begin by actually
narrating a little bit in honor of the three honorees that
I have interacted with. So when I came to MIT I was
next door neighbor to Alan. And I recall how
welcoming Alan was and how many times I just
walked into his office and he shared his
valuable wisdom. But even beyond wisdom, the
energy and the enthusiasm that he radiated, I found
that really inspirational. And on the same floor,
I got to know Sanjoy. And Sanjoy is really impressive
in so many dimensions. He is a true scholar. He has interests
broadly in science. He's not a narrow person at all. And I loved the
tremendous breadth. And I recall pretty much
any mathematical topic that I mentioned to
Sanjoy, that these days I'm interested in geometric
measure theory, whatever, any topic, Sanjoy says,
look at this book. And Sanjoy always
had a connection or a point of reference
which broadened my exposure of
knowledge to anything I talked about with him. And then I changed
offices recently. And I am next door
neighbor to Dimitri. And in fact, my journey
with optimisation, a lot of my research
lies in optimization, begins with actually
learning from Dimitri's book on non-linear programming. And actually, I knew
Dimitri long before, personally, before
I came to MIT. Because I organized
at that time NIPS, these days NeurIPS
conference Workshop on Optimization for
Machine Learning. And the very first
edition of that conference is that was Dimitri was
a plenary speaker there. And then at the 10th anniversary
edition, he was again there. And so my association
with him goes long back. And I just find
that while Dimitri, you write books faster
than I can ever read them. It's just amazing. But probably that has
happened to many of us. So that was honoring them. So now in light
of Alan's session, let me just comment a few
things about some past stuff, past-past stuff, and then
ultra-biased recent past and present, ultra-biased as
in bias to my own research interests. So I'm not going to read
through the laundry list. But I just wanted to mention-- I actually call it LIDS Related. Most of it is Alan-related
and Alan-plus-Pablo-related, which is up there. Because I just wanted to mention
that stuff in this session, really impressive work which
you have seen references to on graphical models and
inference from Eric Sudderth and Wainright. This is still biased. I didn't manage to comb through
the massive list of alumni that Alan has on his web page. And foundational work and
sparsity with both statistics and optimization flavor
and lots of other stuff I just associated
some names with that. And within that context,
within that broad context of statistics, signal
processing, machine learning, and optimization, I
kind of feel that I fit kind of right in there as a
joint member of Alan, Sanjoy's, and Dimitri's group. And so let me mention a
little bit about stuff from my recent past now. Because I am proud that my first
three PhD students graduated earlier this year. And you're looking at
stuff in large scale. So I've been looking
at stuff in large scale non-convex optimization
for actually even longer than
before large scale non-convex optimization
became the thing, thanks to deep neural networks. And we have some
interesting results in there regarding stochastic
gradient methods which probably many of you have heard of. But if you want to
have methods that are provably and empirically
better than that method, you want to save on computation. How do you go about
designing that? And a different topic,
which is, again, intimately motivated
by challenges and practical problems. Consider the following
simple problem. You have n items. And you wish to pick
a small number of them to recommend to a user. So if you have n items, the
number of possible subsets is true to the n. So you have an
exponentially large number of choice of subsets
to pick from. And you now want to pick out
of that a diverse subset. You don't want to recommend
the same thing repeatedly, like Amazon and
Twitter still do, ruining life for many of us. How do you go about that? So it turns out that this
harmless sounding question which underlies pretty
much any recommender system on the internet
leads to this really cool probabilistic modeling question. I have n sets of discrete items. I want to place a
probability measure on them so that if I sample according to
this measure, somehow the items that it shows me are diverse. And this leads to fascinating
areas of mathematics connecting to what is known as the theory
of real stable polynomials, which have been very influential
in the past few years. And it turns out
that by building on a variety of deep
mathematical connections, you can actually sample from
this exponentially large space in polynomial time. And not only polynomial time,
in essentially linear time. Essentially. It's kind of remarkable. So that was like the bulk of the
work of two of my PhD students earlier this year. And then back to optimization. I will comment on this again
on the next slide, actually. But let me actually
draw one picture. This is a lecture
room after all. So in the context of
non-convex optimization, I am forever
interested in thinking of what could be
global structures in our mathematical
models that we could identify so that despite
non-convexity we could still get tractable optimization. Of course, there are
many piecemeal structures that do exist like in
combinatorial optimization there is stuff based
on matroids, et cetera. Or in non-convex optimization,
for people in control systems, they are deeply familiar with
so-called S lemma, structure which does the trick for you. But more broadly, if you
ask this abstract question at this level, if you have a
non-convex problem for which any local optimum
is a global optimum, is this really a
non-convex problem? Should it be solvable tractably? Or should it not be
solvable tractably? And it turns out it's
not so easy to answer this question, by the way. But it turns out that if it is-- it has this property
that any local optimum is a global optimum, module of
some details, forget those, then there exists-- this is just the math part-- then there exists a
reparameterization of this problem so that
it starts looking convex. But not convex in
the usual sense. So I'm just going to draw
one picture to tell you, so that you-- take that picture with you. For those of you who
have heard me speak, you may be familiar
with this picture. For those of you
who are not, I think this is a valuable
picture to take with you. It broadens how you
think about convexity. So this is x. This is y. This is a point, 1 minus
t x plus t times y. Pardon my world
famous handwriting. So convex functions
satisfy this inequality. This is essentially the
definition and consequence of definition of
convexity, that the value of f at any point along
that line joining x and y is upper bounded by the
arithmetic average at the end points. Well, I drew a line-- AUDIENCE: [INAUDIBLE] SUVRIT SRA: Yeah. Thank you. Thank you. Thank you. Thank you. Typo. So the cool thing now is
what if I joined x and y not by a straight line, but by a
curve which goes from x to y, experiment raised by t. Suppose this curve
happens to be the shortest path on a curved
space like a manifold or some other nonlinear space. And you, again, have the
same interpolating property, that at any point on the curve,
the value of your function is upper bounded by arithmetic
average at the end points. Well, then you
get what is called the theory of geodesic
convexity which has been deeply influential in mathematics. But for optimization, it turns
out to say, OK, now these are-- there's is full family, a very
rich family, of cost functions that are not convex under
the usual Euclidean lens. But they happen to have
a curved convexity. And if they happen to have
this curved convexity, can we build a theory of
polynomial time tractable optimization for them? Because if you go back to the
history of convex optimization, one of the most significant
achievements in that theory is cocktail level speech,
that convex optimization is polynomial time
tractable, et cetera. Can we make such a statement
for this much richer class of problems? That's an open question. But first results
on that direction come out in the
global complexity results for money
and optimization in my PhD student's thesis. And for those of you who have
been following along stuff in optimal transport,
many of you may be familiar with
optimal transport, you may have heard of it. Last Field's Medal went
to optimal transport guy, before that also it went
to an optimal transport guy at some point. One of the most important
results in there is what they call
displacement convexity. Displacement convexity
is a special instance of such type of convexity. So just saying that this
concept actually has been around in math, but for optimization
we are only beginning to explore its ramifications. So let's see how
the forward works. So I kind of actually already
told you about the present. Let me just mention one
or two concrete results as additional takeaways in case
you don't like these pictures. Maybe you like those ones. This is actually joint work with
the Ali, who's sitting here, and our students, studying
theory of deep learning. For instance, a result
that I'd like you to-- one of my favorite recent
results on this kind of theory is we're trying to understand
how good at overfitting are neural networks from
a formal perspective, like coming back to,
say, Peter's talk. And it turns out, a very
interesting property, if you have n training
data points, regardless some finite
dimensional space, then there are two hidden
layer neural network using these [INAUDIBLE] nonlinearities
that Peter also mentioned. Can perfectly
memorize your data. That means you can
get training error 0. And this is a tight
result. So there's a necessary and
sufficient condition on the size of the network. This is not saying, OK, send
the network size to infinity. You have a universal
approximate. It's a very concrete
result. This is a practical sized
neural network. And it has the power to
memorize your data perfectly. And we're trying to
understand, somehow, the capability and the
limits of these networks and while we're in this regime
of saying, oh, yes, overfitting is not necessarily bad, but to
formally understand when is it guaranteed that
overfitting is possible. And how much beyond the
theoretical lower bound on overfitting do we need to
make it big so that we can also make statistical statements,
the kind that Peter hinted at, and so on. So I'm not going
to go through this. Let me just make a quick comment
on the future challenges. So fortunately, the first panel
on future challenges, Peter already talked about those. But let me take a step
back and say that given how machine learning, data science,
and practical statistics, how widely they have expanded,
some much bigger topics should really be our
focus of the future. So of course, progress comes
by working on fundamental stuff block by block. But it's very
important, in my view, to keep in mind these
bigger goals like can we now translate our ideas
into important applications in science? Broadly-- physics, chemistry,
biology for instance, medicine, and so on. Crucial questions, kind
of hinted a little bit by Constantine. But more broadly, what are the
ethical aspects of the research we are doing? You're building these machines
with decision capabilities influencing lives of peoples
in uncontrollable manner. What are the implications
on ethics and discrimination and fairness, et cetera? And I'm not going to go
through the whole list. But two more very
important things in there. As we translate the progress
from all this learning optimization methods into a
wider spectrum of problems, we do need to come
back to the questions that robust control
has grappled with. Can we rely on these systems? How to deal with the
robustness, safety, adversarial, all sorts of important concerns. And finally, I'll
stop by saying one of the ways I believe
by which we can tackle the tremendous difficulties
in sample complexity is by incorporating knowledge
from causally informed models to actually help us
pick better models, than just trust
everything to the machine. So with that, I'll stop. Thank you. [APPLAUSE] AHMED TEWFIK: For something
a little different, and perhaps some heresy, but
I've run my slide by Caroline and she said it was fine. So what I'd like to talk about
is something a little bit different and
basically trying to go from just thinking
of machines to let's think of machines plus the
human being as one entity. And let's see what
we can do with that. And in particular, there is all
of this talk about AI machine learning putting out of jobs. And what I'd like to say here
is that perhaps AI machine learning would render us more
creative and more innovative as opposed to put
us out of jobs. So I'd like to run a few things
that we've been working on or that we've been exposed to
and use that as a motivation. And there will be
almost no equations. So this is the
heresy in this talk. But I can guarantee
you that there is some interesting results
to be obtained there. So a few years ago, we were
funded by British Petroleum to look at sort of after
the Macondo Event, which you remember what happened is,
was a series of mistakes that ultimately led to this
explosion and loss of life. And if you look at
that particular event, it's not like people were given
some tough problems to solve. So they weren't given the
type of qualifier problems that we used to get
as graduate students. But what happened was a
series of simple mistakes that we all do all
the time-- you know, Alan pointed out to me, I
was looking at my title slide there. It had a typo and
I never saw it. And you're driving and
you see the red light and you go through
the red light. And so that's what happened. So there was a series of events. Some of them are mistakes
made by a single individual. Some were mistakes made
by a group of individuals. But none of it was complicated. And the final mistake
when the person realized that the driller
made that mistake, it was a little bit too late. I mean, they had a few
minutes to live essentially at that point. And so the point is,
wouldn't it be cool if, say, my Apple
watch in some sense could tell me that
I am experiencing some serious cognitive
biases at this point. And then somehow, with
augmented reality perhaps, there's a way of presenting
me with the right information in the right order so that
I make the right decisions. So another example
is some of you flew long distance to come here. And you were wearing your
noise canceling earphones. And the flight attendant comes
and starts to speak to you. And of course, you
can't hear what the flight attendant is saying. And then you start fumbling
with all of these buttons or you try to remove
your headphone. Wouldn't it be cool if your
headphone would realize that you're no longer
listening to the music, you're attempting to
listen to someone else, and then automatically
cut the music. Or wouldn't it be cool
if I was talking to Siri, let's say, and I
say Siri, what's the weather like in Cambridge. And I actually pause and I
say, tomorrow, wouldn't it be cool if Siri understood
that I was going to continue. And of course,
the answer, what's the weather like in Cambridge,
which implicitly would be today versus tomorrow, could
be very, very different as we've experienced in
the last couple of days. So in applications like
this, the point is-- and I'm trying to think
of man and machine as one. In the first instance, as
I'll say a few words later on, I can try to solve this
problem without explicitly trying to send a human being. Meaning if I know what
kind of information that human being
was exposed to, that can give me enough hints
for me to decide what to do and how to interact
with the human being. And in these
applications, I'm not trying to come up with a generic
model for the human being. I'm very interested
in this specific human being at this particular
instance in time. And this latter
application, I really have a sense of the human being. So there is some interaction,
in this case trying to sense our, quote, brainwaves
but without implanting electrodes in your brain. I'm trying to do it using
sort of the hardware that you're using,
meaning embedded in this noise canceling
earphone or your airports. Things can get more complicated. And I'm not going to
get into this because, for example, if we were-- god forbid, one of us would
experience a spinal cord injury then the communication
between the brain and what's below that spinal
cord injury is completely lost. And that's kind of a problem
because for some of our organs, we tend to have
multiple control loops. And so for example,
for my bladder, there is a local control
loop that we're born with. So if the bladder fills up
to a certain point, we void. But there is another control
loop that comes from the brain. And that control
loop is what allows us to behave the way we
behave now as adults. But on the other hand,
if that control loop is severed because they
had a spinal cord injury, then this no longer works. And in particular, as a result,
the voiding is not complete and so you can have urinary
tract infections and the like. So in this case,
what I'd like to do is not only get the
signals from the brain and then send them
back to the bladder, but I also need to send
the controls signals or the sensing signals from
the bladder back to the brain. Similarly, if I replace my-- if I lose, again, god
forbid, my limbs and then I have artificial limbs, when
we walk the sensing that we're walking on a flat
surface versus perhaps some gravel is what helps
us maintain our equilibrium. So all of these
problems are problems in which you need to think
of man and machine as one. And as has been pointed out
in a number of talks before, people really have thought
about these things long, long time ago. So this Licklider was actually
a psychologist and a computer scientist here at MIT. I'm not sure whether he was in
the predecessor of LIDS or not. But he is someone who
looked at these problems and then later on moved to BBN. And he articulated this
vision of let's start to think of man
and machine as one and what can we do with this. And again, this
is not a question of the human being
losing control. It's just a question of
augmenting our abilities even if we're, quote, normal, to
the extent that any of us is normal. So there is evidence
that actually this is quite powerful. So this is taken
from the introduction to a chess book written by Garry
Kasparov around the mid-2000s. And this is a
particular competition that happened around 2005. And it was a strange
competition, not strange. I mean, it was a
different competition in that it was an online
chess competition. And we didn't ask you-- you know if it was John
Tsitsiklis playing we didn't make sure that it was John. It could be John, it could
be a group of people, or it could be a group
of people plus machines. And some of the things that
came out of this particular game were not a surprise because it
had been established by then that the best machine actually
beat the best master chess player. But the surprising
result was the team that won was not a
team of master chess players with the best
machines out there. They were actually
a team of amateurs with some average machines. But the only difference
is that in their case, the machine was
adapted to the person. So they had worked on the
interface between the machine and the person so that
the machine could present to the person its intuition-- or not its intuition,
its analysis, I guess. And the people could then,
embed, into the machine their intuition. And there are many, many
results along those lines so there are studies
that were performed on a couple of
hospitals in the DC area in which you can show that
the best machine learning algorithms actually beat
your best pathologists and radiologists at detecting
certain types of cancer. And I vaguely remember
my error rates for the humans being
in the 4% to 5% range, and for the machine in
reading in the 3% to 4% range. But then if you combine
the two in the right way, you can bring these error rates
further down to 2% or less. So going back just to
illustrate how this might work, so going back to the BP example. We were in an oil rig. And the oil rig is
heavily instrumented. So everything gets measured. And beyond the oil rig, you
also have all sorts of data because you have information
about the currents in the Gulf of Mexico, you
have atmospheric information, et cetera. So all of this
information is going to then go into a
set of algorithms, so your best machine
learning algorithms. And they're going to
make some predictions. They're not going to make
some decisions for you. They're going to make
some predictions for you. And then all of this information
then is fed to a person-- or the person that
we were studying was called the driller. The driller, essentially,
is the person in charge of actually
draining that particular well. But that driller also controls
a group of other human beings who are doing various
things on the oil rig. And there is a
ton of information that is sent through this
particular driller, all the time. So there are big screens and we
all sorts of information there. And there are groups
on the oil rig who also are looking
at the information. And there is another
group onshore that's also looking
at the information, obviously with the delay because
there's a delay in transmitting this information. And so the idea behind
this project was, we're going to take
all of this information and we're going to let
the machine make decisions based on the data that
it can understand. Not decisions, but
again, predictions. And we're going to show
that to the driller. And then we're going to
make certain decisions on what exactly to show
the driller in what order. So for those of you who
have cars that have heads up displays, you know that if
you're driving down the highway the car is just going to
show you, say, your speed. That's the only information
that's really pertinent to you at this particular
point in time. We don't want to clutter you. We're not good at dealing
with lots of data. Then as you approach and
you're at certain points and you're going to go
through certain maneuvers, then it's going to
start to show you the minimum amount
of information that you need in order
to perform the maneuver. Here it's a similar story. We're going to take the data. And then we're going to not-- the data continues to be
displayed because for liability reasons, OK, we didn't
really block anything from the driller. But we're going to show
the right information at the right time. And so that's
essentially how it works. Now how do you go
about doing this? Well, so philosophically,
going back to the lesson that I learned from Alan
as a graduate student, the idea is we would
like to come up with models for the cognitive
biases that we all have. And these models
are not necessarily models of how our brain works. But they are models
that are good enough for us to engineer the
interventions that you want to engineer. And that's not something that's
sort of strange to engineers in the sense that,
as a community, that's how we approach the
problems of audio image and video coding back in the
late '70s, '80s, and '90s. Basically, we didn't
try to understand exactly how the brain
works or how our eyes work or how our ears work. However, we understood from
the psychoacoustic literature, from the psychovision literature
that there are phenomena called masking which if I play one
tone and I play another tone, and if I play the other
tone at a certain magnitude, then you're not going to be
able to detect the first tone. And then using that, we then
determined by taking a signal and analyzing it
using these models, determining what information
is really important and where to spend my bits
versus other information that's not important and where
I don't have to spend my bits. So the idea here
is the same thing. I would like to take the
cognitive biases that we all experience, so we all think
we're rational thinkers and we're great. But that's not true. Even if I tell you that
we're going to test you and you're going
to still be prone to these cognitive biases. So as an example of
a cognitive bias, if I ask you how
much money would you like to pay for a bottle
of wine from Alan's cellar. And before you gave
me your answer, I ask you to add the digits
in your cell number, which have nothing to do with wine. If you end up with
a large number, on average, you're
going to bet a-- you're going to give
me a larger sum. And this is a very
well known phenomenon called the anchoring bias. And we are all prone to it. So next time that
you look for a job and they ask you how much
you expect as a salary, make sure to throw in a large
number that's realistic. Because on average, you're
going to get a larger salary. So the other problem
that you have here is that as human beings,
the order in which we see data affects our
decision, because it affects the cognitive biases
that we're going to see. And again, as an
example of that-- well, I'll come
back to that later. So that's the only slide
that has equations. Because the cognitive biases
were defined by the psychology literature as departure
from the Bayesian approach to decision making, we
have a good starting point in which we can take the
mathematical sophistication that we have and
then try to embed the empirical knowledge into
something on which we can act. So going back, again, to
the first two or three lectures of the detection
and estimation theory course that Alan
was teaching, we were exposed to this binary
hypothesis testing problem. And just to simplify things, if
all of my data is independent, that [INAUDIBLE] statistics
just by adding these up. And I compare with a threshold. If it's larger, I declare H1. If it's smaller, I declare H0. And the theory
tells me exactly how to form my sufficient
statistic and it also tells me, if I pick my threshold
in a particular way, what it means, whether it's cost
wise or my false alarm rate or whatever. Now you can think
that and you can start to modify
it to incorporate the effects of the cognitive
biases on a human being. And one way of doing
it is to say, well, every time the human being
gets a piece of information, then that's going to be
weighed by some weight and that weight is a
factor of all of the data that human being
has been exposed to up to that particular point. And then the threshold, it turns
out also, is a function of what the human being, where
it was exposed to. And if I had more
time, I can take it-- so there are some cool results
that you can establish. You can prove some bounds
on the kind of performance that you can get. There are some interesting
links that you can make to-- some basic computer
science type of algorithms like the approximate
subset problem. Because you have to
modify it because now the order in which you see
that data is different. And there are some core
problems that you can solve. So for example in
the fraud detection or the problem they
looked at, you can then start to ask yourself
the question, there's a lot of data. What should the machine
do or the machine-- let's say. Let me back up. So in fraud detection, I
can design machine learning algorithms to detect fraud. They're going to be as good
as the data on which they were trained. So throw all of this
data at it, train it, and then it's going
to look at the data and make some decision. But if the person committing
fraud is smart enough, then that person is going to
try to come up with something that you haven't seen
before and may even beat this machine
learning algorithm. At that point the
question becomes, what should the
machines show you. What kind of-- what
subset of the data should the machine show
you as a human being in order to be able to solve
the problem so that you're a combination with the machine. You can get, perhaps, to
the 90%, 99%, 95% range. Now these problems are-- yes, I'm going to just
take three seconds. These primaries are tough. So for example again, back
to the 737 MAX problem, I'm using one of
these systems, I don't know who
designed the system. And I don't know what's
embedded in the system. And furthermore, the
system doesn't know me. So again, back to the planes. The plane lands and there's a
different crew that comes in. Yeah, we can think of I log in
and my profile would come in. But these are tough problems. And a lot of the theory
that was developed here over a period of
time can be extended to solve some of these problems. And I'm at least excited
about some of this work. Thank you. [APPLAUSE] CAROLINE UHLER: Thanks. So I would like to thank
all of the panel speakers for providing this very
diverse set of views on the current, the future, and
some of it also about the past. So I would like to open it up
to questions from any of you. And then I can also ask
some questions as well. AUDIENCE: Thank
you for the talks. How might the landscape of
composite complexity problem change if we challenge it
with [INAUDIBLE] computing, neurocomputing, [INAUDIBLE]
in quantum computing, and even integrating
[INAUDIBLE] system for human, for empower and artificial
intelligence system? AHMED TEWFIK: So I didn't
quite get the question. AUDIENCE: The
complexity to how might this landscape change if we
use [INAUDIBLE] computing. AHMED TEWFIK: So all of-- I mean, as we walked in
the building this morning I told you that you ask
difficult questions. But I'll try to answer
to the best of-- so I think that in
all of these things, you need a real
time intervention. It doesn't help me if I'm
analyzing this information forever. So as computing technology
continues to evolve, that's really what is
enabling us to start to think about these things. But I think what this group
of people is good at is as we develop algorithm or as
we develop methodologies for addressing these problems
is to give you some fundamental bounds on what can or what
we can or cannot achieve. And I think that that's
what's most helpful. Because then technology,
as it evolves, will get us closer and
closer to these bounds. CAROLINE UHLER: Any
other questions? Maybe? Oh, yeah. You want to-- PETER BARTLETT:
Just to comment on-- I guess I wanted to draw
attention to something that Constantine
said about the risks that we take in what we teach
our graduate students, that was a really interesting thing. It's kind of fun to think
about that the decision problem of what is it that we're
going to work on in research or taking this longer term view
what is it that we're going to teach our graduate students. And in some sense, we
really should be risk seeking in those activities. We're more like a
venture capital-- it's more like venture
capital than a mutual fund. We want the sort of very
occasional high impact things over the
humdrum small advances. A lot a lot like the
session yesterday when Ben gave his talk about
this Thompson sampling view. Actually if you
look at the paper that Thompson wrote
in round about 1930, it was actually
motivated by this-- you can imagine
reading the paper that he's sitting
back philosophizing about how am I going to
decide what to work on. It's very much the same type
of perspective-- blue sky-- what should we-- what
might we hope to-- even if it's very unlikely. AUDIENCE: Motivated
but what Suvrit said, this relationship between
causality and causal models on one side and machine
learning on the other, could you elaborate
a little more or if panelists have a little
bit more thought on that. I've sat here
yesterday and today. I'm not sure that I've seen
one single causal model, the equations, like
the ones that we use in dynamical systems. It's a really big
disconnect for people who work with applications. So any thoughts on that? SUVRIT SRA: So I'll
make one comment and then probably
the causal experts may have something to say. Because there's multiple
views on thinking about causal models. But where I was coming from
is the following, that you're often, when entering a new
domain where you say, hey, I'm just going to solve
everything using data, I don't need your models. That's a Silicon
Valley style way of thinking about
machine learning. Whereas if you
enter a new domain where you don't have unlimited
amount of data, which is quite common, by that
I mean unlimited amount of labeled data, then having a
better understanding and better formulation of the task
you're trying to solve. So for instance, you could take,
I guess something in science. You want to help somebody
control their quantum computer better. It's a pretty complex
physical setup which requires some deep
knowledge from physics about how signals
are being read out, how they can be controlled. And there's a lot of
knowledge about how quantities interact
with each other is what I broadly
meant by causal models. And if you don't take
those into account, you're kind of making-- it's a wasted opportunity. AUDIENCE: Actually my question
was what is state-of-the-art on that. Is anybody working
on that connection? That's my question. SUVRIT SRA: So I guess people
brought-- several people I've talked to in
machine learning they care about by
saying, OK, I care what mechanistic models so that
I can reduce sample complexity. But I've heard that
from many people but only seen few examples
of putting these two things together. One recent example I
can mention to you, I ran into some
reinforcement learning based control problem
where somebody said, OK, trying to reduce
sample complexity, they actually endowed
their mathematical model with some approximate
differential equation based model of the dynamics. And they could reduce their
sample complexity on that toy task by hundreds of times. But that's like
just a toy example. Something at a much bigger scale
for general learning and safety and everything, I
think people need to pay more attention
to that, which is why I put it in the future
must-think-about category. AHMED TEWFIK: I thought that
Ben Stock yesterday pretty much did it. I mean, he had the causal
part and the neural networks in the right place. CAROLINE UHLER: I
mean maybe just a word about this like of machine--
before we'll hand over. In terms of causal
modeling, I think machine learning has been
very, very successful in going just straight
to predictive modeling. But what it is
really built around is that we just have
observational data. And so the whole
framework is in mind with just observational data. But I think what is
particularly exciting right now in thinking about causality
and how machine learning can maybe-- how we can actually
bridge the link from predictive to causal modeling is that
in many different fields we're actually getting access
to actually interventional data. So that's what I find
very exciting in genomics is that we have perturbations. Or even if you
think in ads, like you have all these
perturbations. You're getting to see all
this interventional data. And this is, I think,
what will let us in the end build also
a causal framework and really bridge the gap
between predictive and causal modeling. But we need to be able to
interact with the system. Only then can we actually
learn the underlying causal structure. And I think here there's a lot
to do on really also bridging what people already
know in control theory, with what we're doing in
causal inference in particular, and really get machine
learning together here. AUDIENCE: Yeah, so
just building on that bridging question
or issue is, where does the theory
that many of you-- you all talked about bridge
with the practice that's being done in machine
learning now, which is much more heuristic? And when you look
at the applications where machine learning
has had a lot of impact on image recognition and voice
recognition and translation, to me these are
areas where we didn't have good mathematical models. So for example, used hidden
Markov models for voice and then solved
optimally mathematically, but it was a poor model. So is the place where
we can bridge the fact that the machine
learning gives us insight into new models, new
mathematical models for those problems that we can
then solve, is it complexity, reducing complexity,
understanding more about training,
how long you need to train, how to extract
the right kind of training, where are the key areas
to bridge and bring our mathematical foundations to
the machine learning problems that will impact the
practitioners that are using it now? CAROLINE UHLER: Peter,
do you want to-- PETER BARTLETT: Sure. Yeah, so I think a lot of-- in terms of the settings
where we've seen deep learning methods, for instance,
be very successful, they've relied on an
enormous amount of data. And there has been this kind
of empirical observation that in these settings,
the more data you have, the better a model
you can build. And so I think they're not-- it's not like the case of
building a physical theory where you can have some
very precise model. And it's a really good
reflection of reality. There are always nuances. There always more and
more subtle things that you could
include in a model, if only you had more data
to reliably estimate it. And that seems to be the
kind of part of the landscape where these methods have been
really, really successful. That's very much
non-parametric statistics. It's the domain of that area. I think it is really,
really interesting problem to understand, if we are in a
setting like a robotic setting where you have, for
some part of that, you have very precise
physical models maybe with a bit of uncertainty. For some other part, like what
this robot is interacting with, all bets are off. But perhaps you
could gather a lot of data about that [INAUDIBLE]
I think bringing those two together [INAUDIBLE] direction. SUVRIT SRA: So I would
add two comments to that. One being if we actually
managed to understand why these models
are successful, I guess then we'll be ready
to actually say, OK, how to now build
simpler better models with the other
properties that we want. But we're not quite there yet. So once-- place
where some of that's LIDS style research does
directly connect is, fine, people care about these models. But right now, say,
to train such a model, you burn so much
energy how could you reduce that energy footprint? So that's a direct
question for engineering, for hardware, as
well as optimization. And I think like one of
the places where theory is contributing
already now is to, say, OK, I want to
understand this model. I want to understand the kind
of data that it works with. So what other characteristics
in the data and what are the characteristics of the
specific deep neural network architecture that I should
abstract away and use those to guide how I design
a training algorithm? Not just use the blind
method that everybody uses to greatly reduce the cost. So that's kind of
doable right now. The other things are a
little bit further away. So I think by making
these small steps, maybe we enhance
our understanding of why these models work and
then come up with better ones. Because I think somebody else
also commented on this that-- or maybe even Peter
said, I kind of-- at least few people
have mentioned that these models seem to be
working but somehow a priori there is no reason why this
should be the only model class worth thinking about. But we're not
there yet, clearly. AHMED TEWFIK: I don't
see machine learning. It's not the magic bullet
that solves everything and we may not understand
exactly how they work. But we understand at
least for some classes, some characterizations of that. So for example,
again, being sort of a generalization
of the sparse representation that
we've designed before. And then once we have
that understanding, now we can start to come up
with perhaps a wider class for sparse reconstruction. These are some things
that seem to be extremely helpful or trying
to understand the partitions and then bringing it back to
detection estimation theory. And then when you come to
adversarial machine learning, then have some fundamentals of
maybe what kind of robustness you can impose on
these networks, and in particular with all of
the redundancies they have. So at least that's the
way I'm looking at it. AUDIENCE: Yeah. Thank you for
insightful comments. I want to go back to
Constantine's issue that he raised, which I
think is really critical, and maybe hear the rest of
the panel comment on it. So another way to
state it is that we all want to address these complex
complicated problems that require a certain
level of sophistication mathematically, understanding
of where the problems are impactful and so forth. And yet we are inside the wave
of maybe very fast publications and posturing maybe in a field. And somehow does the
community have the patience to educate the students
to get them to that level, have them work on
these hard problems? Maybe I'll publish one paper
at the end of the five years and then have them be
ready for this job market. How would you address
these kind of questions? CAROLINE UHLER: Can I
maybe add to this question before I give it back? Something that you also
said on the empirical side, because many of
the papers actually have empirical insights,
and since we're also, when I work with undergraduates
on empirical questions or also theoretical questions, I
feel one thing is really lacking in the computer
science education. And we see it also when
we read machine learning papers is that we're not trained
to do careful experiments. So many of these papers are
just you have very heavy claims based on really, really
weak experiments. So how do we go about that? And how do we go about
actually pushing forward the field when we don't really
know what is actually true and what is not from
the experiments we have? CONSTANTINE CARAMANIS:
I'm supposed to have an answer to that? I think that we are
seeing a lot of what the students are demanding get
commoditized pretty quickly. If you think about
what-- if I teach a hands-on machine
learning class, if I want to think back
on what was good enough to get a student, an undergrad
student, a job 5 years ago, if you could run
cross-validation, use scikit-learn, basic Python,
you were in great shape. Then after that, you know
how to set up tensor flow. You know some cloud computing. Basics of PyTorch,
that's good enough. But already things that
were exciting projects, final projects for
the class 3 years ago are now so easy to do in
a couple lines of code that actually that's
not important anymore. So I'm actually-- my
hope is that things are moving fast enough that
the core will be revealed. But I think it's something
that we have to-- that we'll have to
deal with somehow. AHMED TEWFIK:
Basically as technology is automating a lot, and a lot
of the things that we're doing, I mean I think the fundamentals
and the core becomes more and more important. And in some sense, even
as a student, and so they get excited by putting
things together quickly and getting some cool results. But in a competition to
get to the next level, they quickly realize
that they really need to understand
the fundamentals and they ask for it. So I'm hopeful that we actually
will get more students. Because we don't
know what job they're going to hold next year, let
alone in 20 years or 30 years. And so the fundamentals become
more and more important. And I think the students
are realizing it. SUVRIT SRA: But
I'll add to that. So if you OK-- what training should
undergraduates get, that's I guess always
a moving hard target. And it's easy to say
that fundamentals, but sometimes the
incentives are misaligned because as Constantine
hinted, the companies, they just want some completely
different skills from them. But a bigger challenge,
I think, already for the field of
machine learning is at the stage of
graduate students. Because they are under
tremendous pressure to get the next
archive free print out. And that intense pressure, the
only way to address it is by, say, being weak in experiments
but having strong claims. And it's not something. It's like a broader thing
because of, let's say, hype. But somehow we can do our share
to at least, for our students, give them an environment,
which LIDS always does, to value the fundamentals. That it's OK you
know to not succumb to this intense
pressure which they do feel from their peer world. And at least, a core
of people who do care about this, that
culture will eventually live longer, as I said. 20 years down the line, they
will be actually thankful that they spent that effort. CONSTANTINE
CARAMANIS: It's tough though because the
publication numbers are very different than I think what
they were even a few years ago. SUVRIT SRA: I mean,
now you have undergrads who apply to MIT typically have
two, three papers in NeurIPS. GUY BRESLER: Yeah. But I think the flipside to that
is that people are overloaded and nobody has time to read
all these papers anyway. And so what you get is a kind
of self-correcting effect where people then
really appreciate that you've spent the
extra time to write this paper in a way that
would be pleasurable to read. PETER BARTLETT: But we have
a real role to play in that. In the last two or three
years, I've found myself on a bunch of
qualifying committees, qualifying exam
committees, where I've been saying to students,
you need to publish less. I think we have a
responsibility to enforce that. It's not just that it's OK. It's actually good. Getting people not doing
weak experimental science, getting them to spend a bit
of time thinking about-- spend more time thinking
about hard problems and-- AHMED TEWFIK: Also
the reality is that companies aren't recruiting
students from top universities like MIT to write a
couple of lines of code. I mean, they're hiring them
because of the deep knowledge and the creativity,
the value that they're going to offer over many years. So hopefully we
don't change that. CAROLINE UHLER: I would like
to just maybe have one more question before
we have to break, since you're waiting
for a long time. AUDIENCE: So this is
more like an observation, maybe, than a question. So over the past few days, we've
heard this phrase, LIDS type or LIDS style research. So let's take machine learning. Probably the most perfect
example of LIDS style work is [INAUDIBLE] invested
uniform convergence, work for decades ahead of the time. And we have this
[INAUDIBLE] vector machine which has a beautiful
theory worked out. And so it went from
theory to practice. But where we are right
now is in an area where we're going from
practice-- somebody like Yannn LeCun persisted
for many years trying to make these things work. And so we're more
like trying to explain experiments and observations
more like physicists than the older
style of research. Just an observation. CAROLINE UHLER: And
this is exactly what I think is actually missing. Because we're going in the
direction of like a physicist, we should also be trained. And because they are trained
in performing very careful experiments and we're not. And I think this is
really one of the dangers in this particular area. SUVRIT SRA: I think it's a
great thing, actually, in fact. Because if you think--
if you look back at the history of
all of mathematics, a lot of the questions
there were invented to answer physics problems. You look back at
Fourier analysis, it comes from the heat
equation, et cetera. That's just a physics problem. Pretty much all the
developments you look back at constructions in
differential geometry, whatever then sort of whatever the kind
of general relativity theory that builds on top of-- so a lot of the
math has always been to answer things that people
who are trained to understand in the physics world. And now, OK, we are not
looking at physical models. We are looking at
computational models. And that can inspire a
brand new statistical and mathematical thinking. So I think it's actually a great
thing that this is happening. GUY BRESLER: I think at the
same time, the understanding of fundamental limits
and the insights that one gets from that,
we can hope that that can lead to better performance. And also in some situations,
like in Ben Van Roy's talk, people genuinely don't
really have good approaches. So we need the theory to
give insight into that. So there's some of both. CAROLINE UHLER: Great. I think on this note, we
can continue our discussions over lunch. So thanks to the
panelists very much. [APPLAUSE]