PRESENTER: Welcome, everybody. I'm very excited to
welcome Fe-Fei Li today. And of course, judging by
how packed this room is, Fei-Fei doesn't really
need an introduction. And of course, if
I actually were to introduce her
by reading her bio, it would take a majority
of today's time. So I'll keep this brief. Fei-Fei is a professor
at Stanford, where she was also my PhD advisor. She's the director of the
Human Centered AI Institute at Stanford. And during the years
of 2017 and 2018, she was also the Vice
President at Google, as well as the Chief
Scientist of AI and Machine Learning at Google Cloud. She, of course, has
published hundreds of papers. And perhaps one of the ones that
a lot of people know her for is ImageNet, which
as you all know, has ushered in the deep
learning AI revolution that we're all in today. She also serves as
a special advisor to the Secretary General
of the United Nations, and is also a member of the
National AI Resource Task Force for the White House Office
of Science and Technology. And recently, she also
has published her book titled, The World I See-- Curiosity, Exploration, and
Discovery at the Dawn of AI. And I'm sure she'll be talking
about parts of that book today. So with that, Fei-Fei,
welcome to the University of Washington. FEI-FEI LI: Thank you. [APPLAUSE] Thank you. Thank you. Well, it's quite an
honor to be here. Actually it's as a professor one
of the greatest joys and honor is to work with
the best students, and see how their
career has grown. And so being invited by
Ranjay and his colleagues is really very special. And I'm just loving
all the energy I've seen throughout the day today. So OK, I want to share
with you a talk that is a little bit meant
at the high level and an overview of what I
have done over the years, through the lens
of computer vision and the development of AI. So the title is "What
We See & What We Value-- AI with a Human Perspective." I'm going to take you back
to history a little bit. And when I say history, I
meant 540 million years ago. So 540 million years
ago, what was that? Well, the Earth is
a primordial soup. And it's all living
things live in the water. And there aren't
that many of them. They are just simple
animals floating around. But something really
strange happened in geologically a very
short period of time, about 10 million years, is from
fossil studies scientists have found there is an explosion
of the number of animal species around that time. So much that that period
is called the Cambrian Explosion or some people call
it the Big Bang of evolution. And so what happened? Why suddenly when life
was so chill and simple, not too many animals,
why life went from that picture to an explosive
number of animal species? Well, there are many
theories, from climate change to chemical composition of
the water, to many things. But one of the leading theories
of that Cambrian Explosion is by Andrew Parker, a
zoologist from Australia. He conjectured that this
speciation explosion is triggered by the sudden
evolution of vision, which sets off an evolutionary arms
race where animals either evolved or died. Basically, he's saying as soon
as you see the first light, you see the world in
fundamentally different ways. You can see food. You can see shelter. You can become someone's food. And they would
actively prey on you. And you have to actively
interact and engage with the world in order
to survive and reproduce, and so on. So from that point on, 540
million years to today, vision, visual intelligence has become
a cornerstone of the development and the evolution of nervous
system of animal intelligence. All the way to, of course, the
most incredible visual machine we know in the universe
which is the human vision. And whether we're talking
about people and many animals, we use vision to navigate
the world, to live life, to communicate, to
entertain ourselves, to socialize, to
do so many things. Well, that was a very brief
history of nature's vision. What about computer vision? The history of computer
vision is a little shorter than evolution. Urban legend goes around
60 years ago, 1966 I think, that there was one ambitious
MIT professor who said, well, AI as a field has been born. And it looks like
it's going well. I think we can just
solve vision in a summer. In fact, we'll solve vision
by using our summer workers, undergrads, and we'll
just spend this one summer to create or
construct a significant part of visual system. This is not a
frivolous conjecture. I actually sympathize with him. Because for humans when
you open your eyes, it feels so effortless to see. It feels that as soon
as you open your eyes, the whole world's information
is in front of you. So it might be turned out
to be an underestimation of how hard it is to
construct the visual system. But it was a heroic effort. They didn't solve vision
in a summer, not even a tiny bit of vision. But 60 years later, vision
today has become a very thriving field, both academically as
well as in our technology world. I'm just showing you a couple
of examples of where we are. Right? We have visual
applications everywhere. We're dreaming of self-driving
cars, which hopefully will happen in our lifetime. We are using image
classification or image recognition and so many image
technologies for many things from, health care
to just daily lives. And generative AI has
brought a whole new wave of visual applications
and breakthroughs. So the rest of the
talk is organized to answer this question. Where have we come from and
where are we heading to? And I want to share with
you three major theses of the work that
I have been doing in my career in
recent few years, and just to share
with you what I think. Let's begin with building
AI to see what humans see. Why do we do that? Because humans are
really good at seeing. This is a 1970s cognitive
science experiment to show you how good humans are. Every frame is refreshed
at 10 Hertz, 100 milliseconds of presentation. If I ask you as audience, I
assume given how young you are-- you're not
even born then-- you've never seen this video. Nod your head when you see one
frame that has a person in it. You will see it. Yeah, OK. You've never seen this video. I didn't tell you what
the person looked like. I didn't tell you which
frame it will appear. You have no idea-- the
gesture, the clothes, everything about this. Yet, you're so good at
detecting this person. Around the turn of the century,
a group of French researchers have put a time on
this effortlessness. It turned out seeing complex
objects or complex categories for humans is not only
effortless and accurate, it's fast. 150 milliseconds after the
onset of a complex photo, either containing
animals or not containing animals, humans you can
measure brain signal that shows that differential signal
of pictures, of scene pictures with animals and scene
pictures without animals. It means that it takes about
150 milliseconds in our wetware, right here, from the photons
landing on your retina to the decision that
you can make accurately. I know this sounds
slow for silicons. But for our brain,
for those of you who come from a little bit
of neuroscience background, this is actually super fast. It takes about 10
stages of spikes from passing from neuron
to neuron to get here. So it's a very
interesting measurement. At around the same time,
neurophysiologists, so we've had psychologists
telling us humans are really good
at seeing objects. We've got neuroscientists
telling us not only we're good at it, we're fast. Now, this last set of study,
also neurophysiologists use MRI study to tell
us, because evolution has optimized recognition so
much that we have dedicated neural correlates in the
brain, areas that specializes in visual recognition. For example, the
fusiform face area, or the parahippocampal
place area-- these are areas that we
see objects and scenes. So what all this has told us,
this research from the '70s, '80s, and '90s have
told us, is that objects are really important
for visual intelligence. It's a building
block for people. And it's become a North Star
for what vision needs to do. It's not it's not
all the North Stars, but it's one
important North Star. And that has guided the early
phase of my own research as well as the field
of computer vision. As a field, we identified that
object recognition, object categorization, is
an important problem. And it's a mathematically
really challenging problem. It's effortless for us. But to recognize, say,
a cute animal wombat, you actually have
mathematically infinite way of rendering this animal
wombat from 3D to the 2D pixels, whether it's
lighting and texture variations, or background
clutter and occlusion variations, or viewing angle
camera angle occlusions, and so on. So it's mathematically
a really hard problem. So what did we do as a field? I summarized the progress
of object recognition in three phases. The first phase was concurrent. It's a very early phase,
concurrent with this cognitive studies is what I call
the hand-designed features of models. This is where very
smart researchers use their own sheer
power of their brain to design the kind of
building blocks of objects, as well as the model, the
parameters, and so on. So we see Geon theory. We see generalized cylinder. We see parts and springs models. And these are in the
'70s, '80s, or early '90s. They're beautiful theory. They're mathematically
beautiful models. But the thing is,
they don't work. They're theoretically beautiful. Then there's a
second phase, which I think is the most
important phase actually, leading up to deep learning,
which is machine learning. It's when we have
introduced machine learning as a statistical
modeling technique, but the input of these models
are hand-designed features like patches, and
parts of objects that are meant to carry a
lot of semantic information. And the idea is that in
order to recognize something like a human body, or a
face, or whatever, a chair-- it's important to get
these patches that contains ears and eyes and whatever. And then you use
machine learning models to learn the parameters
that stitch them together. And this is when the whole
field has experimented with many different kinds
of statistical models from Bayes Net, support
vector machine, boosting, conditional random field, random
forest, and neural network. But this is the
first phase of that. Something also important
happened concurrently with this phase is actually
the recognition of data. In the early years
of the 21st century, the field of computer
vision recognized it's important to have
benchmarking data sets, data sets like the
PASCAL VOC data set, the Caltech 101 data set. That is meant to measure
the progress of the field. And it turned out
they can also become some level of training data. But they're very small. They're in the order of hundreds
and thousands of pictures, and a handful of
object categories. Personally for me,
this was around the time I stumbled upon
a very incredible number. I call it, if you read my book,
I call it the Biederman number. Professor Biederman who sadly
just passed away a year ago, is a cognitive psychologist
studying vision and thinking about the scale and scope of
human visual intelligence. And back of envelope, he
put a guesstimate of humans can recognize 30 to
100,000 object categories in their lives. And it's not a verified number. It's very hard to verify. This is a conjecture
in one of his papers. And he also went on
to say that by age 6, you actually learn pretty
much all the visual categories that a grown-up has learned. This is an incredible speed of
learning, a dozen a day or so. So this number bugged
me a lot because it just doesn't compare to all the data
sets we've seen at that point. And that was the reason,
the inception of ImageNet, that we recognized, my students,
Jordan, and collaborators, and I recognize that there's
a new way of thinking about visual intelligence. It's deeply, deeply data driven. And it's not just
the size of the data. It's the diversity of data. And this is really history. You all know what ImageNet is. And it also brought back
the most important family of algorithm that
is high capacity, and needs to be
data driven, which is convolutional or
neural network algorithm. And in the case of
vision, we started with convolutional
neural network. For those of you who
are very young students, you probably don't
remember this. But even when I was
a graduate student at the turn of the century,
convolutional neural network was considered a
classic algorithm, meaning it was pretty old. And it didn't work. But we still studied it when
I was a graduate student. It was incredible to see how
data and the new techniques revitalized this whole
family of algorithms. And for this audience,
I'm going to skip. This is really too trivial. But what happened is that this
brought us the third phase of object recognition. And I would say more or less,
quite a triumphant phase of object recognition, where
using big data as training and convolutional
neural network, we're able to recognize objects
in the wild in a way that the first two phases couldn't. And these are just examples. And of course, the
most incredible moment, even for myself who
was behind ImageNet, was 2012 when Professor Geoff
Hinton and his students, very famous students,
have written this defining paper as the beginning of
the deep learning revolution. And ever since then, vision as
a field and ImageNet as a data set has really been driven a
lot of the algorithm advances in the pre-transformer
era of deep learning. And very proudly as a field,
even work like RESNET, were the precursors of
many of the attention is all you need paper. So vision as a field
has contributed a lot to deep learning evolution. OK, so let me fast forward. As researchers,
after ImageNet, we were thinking about what is
beyond object recognition. And this is really
Ranjay's thesis work, is that the world is not just
defined by object identities. If it were, these two
pictures both contain a person and a llama,
would mean the same thing. But they don't. I'd rather be the
person on the left than the person on the right. Actually, I'd
rather be the llama on the left than the llama
on the right as well. So objects are important,
but relationships, context, and structure and
compositionality of the scene are all part of the richness
of visual intelligence. And the image, that was
not enough to push forward this kind of research. So again, heroically
Ranjay was really the key student who was
pushing a new way of thinking about images and
visual representation, mostly focusing on
visual relationships. So the way Ranjay and we put
together the next wave of work was through scene
graph representation. We recognize the entities of the
scene in the unit of objects, but also their own
attributes as well as the inter-entity relationships. And we made a data set--
it was a lot of work-- called Visual Genome. That consisted of hundreds
of thousands of images, but millions of relationships,
attributes, objects, and even natural
language descriptions of the images as a way
to capture the richness of the visual world. There are many works that
came out of Visual Genome, and a lot of them were
written by Ranjay. But one of my favorite works
is this one-shot learning of visual relationships
that Ranjay did where you use the
compositionality of the objects to learn relationships
like people riding horse, people wearing hats. But what comes out of it with
the compositionality is almost for free, is the
capability of recognizing long-tail relationships
that you will never have enough training examples. But you're able to do
it during inference, which is like horse
wearing hat, or person sitting on fire hydrant. And that really taps into
the relationship as well as the compositionality of images. And yeah, there were some
quantitative measurement that shows our work at that time--
now it's ancient time-- that does better than
the state of the art. We also went beyond just
a contrived labeling of objects or relationships
that went into natural language. And there was a series of papers
started with my former student Andre Karpathy, many of
you know, Justin Johnson, on image captioning, dense
captioning, paragraph generation. I want to say one
thing that shows you how badly at least me
or oftentimes scientists predicts the future. When I was a graduate
student, when I was about to graduate,
2005, I remember it was very clear to me my
life dream as a computer vision scientist was to,
when I die, I want to see computers can tell
a story from a picture. That was my life's dream. I feel that if we can put
a picture into the computer and it will tell us
what's happening, a story, we've achieved the goal
of computer vision. I never dreamed less than 10
years, just around 10 years after my graduation, this dream
was realized collectively, including my own lab, by
LSTM at that point, and CNNs. It was just quite a remarkable
moment for me to realize. First of all, it's
kind of the wrong dream to say that that's the
end of the computer vision achievement. Second, I didn't know
how fast it would come. So be careful what you dream of. That was the moral of the story. But static relationships
are easier. Real world is full of
dynamic relationships. Dynamic relationships
are much more nuanced and more difficult. So
this is fairly recent work. It was I think at
NeurIPS two years ago. And we're still doing this work
on multi-object, multi-actor activity recognition
or understanding. And that is an ongoing work. I'm not going to get into
the technical details. But the video
understanding, especially with this level of nuance and
details, still excites me. And it's an unsolved problem. I also want to say that vision
as a field has been exciting, not only because I'm
doing some work in it. It's because some
other people's work. And none of these
are my own work. But I find that
the recent progress in 3D vision, in
pose estimation, in image segmentation,
with Facebook SAM and all the generative AI work has
been just incredible progress. So we're not done with building
AI to see what humans see. But we have gone a long way. And part of that is
the result of data, compute, algorithms,
like neural networks that really brought this
deep learning revolution. And as a computer
vision scientist, I'm very proud that our field
has contributed to this. And AI's development
has been and I continue to believe will
be inspired by brain sciences and human cognition. And for this section, I'm
very appreciative of all the collaborators, current
and former students, and Ranjay you're a part of
them, who has contributed. Let's just fast
forward to building AI to see what humans don't see. Well, I just told you
humans are super good. But I didn't tell you that
we're not good enough. For example, I don't
know about you, but I don't think I can
recognize all these dinosaurs. And in fact, recognizing very
fine-grained objects is not something humans are good at. There are more than 10,000
types of birds in the world. We put together or
we got our hands on a data set of
4,000 types of birds. And humans typically
fail miserably in recognizing all
species of birds. And this is an area
called fine-grained object categorization. And in fact, it's
quite exciting to think about computers at this point
can go beyond human ability to train detectors,
object detectors, that can do much finer grain
understanding of objects beyond humans. And one of the
application papers we did which I find
very fascinating, is a fine-grained
car recognition. We downloaded 3,000 types of
cars, separated by make, model, year that's ever built
by 1970s, starting 1970s. We stopped before
Tesla was popular. So people always ask
me this question. Where's Tesla? We don't have Tesla. And after we trained the
fine-grained object detector for thousands of
cars, 3,000 of cars, we downloaded
street view pictures of 100 American cities,
most populated cities, two per state. And we also correlated
this with all the census data that came out of 2010. And it's incredible to see the
world through vision as a lens, the correlation between car
detection and human society is stunning, including income,
including education level, including voting patterns. We have a long paper that
has dozens and dozens of these correlations. So I just want to show you that
even though we don't see it with our individual eyes,
but computers can help us see our world, see our society
through these kind of lenses in ways that humans can't. OK, to drive home this idea that
humans are not that good, even though 10 minutes ago
I told you're so good, is this visual illusion
called Stroop test. Try to read out to yourself
the color of the word, not the word itself. Go left to right and top to
bottom, as fast as possible. It's really hard, right? I have to do red, orange,
green, blah, blah, blah. That's a fun visual illusion. This one some of you
probably have seen. These are two
alternating pictures. They look like the
same but there's one big chunk that's different. Can you tell? Raise your hand if you can. It's an IQ test. [LAUGHTER] m so all the faculty
were thinking, oh no. I didn't raise my hand. OK, so it's the engine. Oh. OK, so it's a huge chunk. This has landed on your retina. And you completely missed it. OK, good job. [LAUGHTER] It's not that funny, if
it's in the real world, when it's a high stake situation. Whether you're passing through
airport security or doing surgeries. So actually not seeing can
have dire consequences. Medical error is the
third-leading cause of American patients'
deaths annually. And in surgery rooms, accounting
for all the instruments and glasses and all that is
actually a critical task. If something is
missing, on average a surgery will stop
for more than one hour, so that the nurses
and doctors have to identify where the thing is,
and think about all the life risk to the patient. And what do we do today? We use hand and count. And imagine if we can
use computer vision to automatically
assist our doctors and nurses to account
for small instruments in a surgical setting. That would be very helpful. And this is an
ongoing collaboration between my lab's health care
team and Stanford Hospital Surgery Department. This is a demo of
accounting for these glasses during a surgical scenario. And this would, if this
becomes mature technology, I really hope this would
have really good application for these kind of uses. Sometimes seeing is
not just attention. Every example I just
showed you there seemed to be attentional deficit. But sometimes seeing is
more profound, or not seeing is more profound. This is my really
favorite visual illusion, since I was a graduate student,
made by Ted Edison at MIT. And I'm just showing
you the answer. This checkerboard
illusion, if you look at the top graph
checkerboard A and B, no matter what I
tell you they look like different
gray scales, right? I mean, how could they on
Earth have the same gray scale. But if I added this,
you see that they're the same gray scale. So this is a visual illusion. Even if you know
the answer, It's hard to not be
tricked by your eyes. For those of you who are old
enough, who do you see here? AUDIENCE: Bill
Clinton and Al Gore. FEI-FEI LI: Clinton
and Gore, right? Is it? Is it Clinton and Gore? So it turned out they
are Clinton and Clinton. And it's a copy of Clinton's
face in Gore's hair, and in a context,
that it is very primed for all of us to see
them as Clinton and Gore. So being primed is a
fundamental thing of human bias. And in computer vision,
we have also inherited, if we're not careful,
computer vision has inherited human bias,
especially through data sets. So Joy Buolamwini
used to be at MIT, had written this beautiful
poem that exposes the bias of computer vision. So I'm not nearly
as a leading expert as Joy and many
other people are. But it's important to point
out that not seeing has consequences. And we need to work
really hard to combat these biases that creep into
computer vision and AI systems. And these are just really
examples of hundreds and hundreds of
thousands of papers and work people are doing
in combating biases. Now on the flip side,
sometimes not seeing is a must, as seeing too much
is also really bad. This brings us to
the value of privacy. And my lab has
been actually doing quite a bit in the
context of health care, but quite a bit of privacy
computing in the past few years in terms of how we can protect
human dignity, human identity, in computer vision context. One of my favorite works
that's not led by me is by Juan Carlos Niebles. That combines both
hardware and software to try to protect human
privacy while still recognizing human behaviors
that are important. The idea is the following. If you want to look
at what humans do, you take a camera you shoot
a video and you analyze it. In this case, a baby
is pushing a box. What if you don't want
to reveal this kid? What if you don't want to
reveal the environment? Can you design a lens that blurs
the raw signal, like you never take the pure pixel signal? What if the designed lens
gives you a signal like that? So for humans, you
don't even see the baby. Well, that's exactly
what they did. They designed a warped lens. And the lens gives you a
raw signal in the top row. But they also
designed an algorithm that retrieves not
super resolution, they have no
intention to recover the identity of the
people, but just to recover the activity they need to know. This way their combined
hardware-software approach not only protects privacy,
but also reads out the insight that whether
you're in transportation cases or health care cases, that is
relevant to the application users. So building AI to see
what humans don't see is part of computer
vision's quest. It's also important to
recognize sometimes what humans don't see is bad, like bias. But we also want to
make computer not see the things that we want
to preserve privacy for. So in general, AI can
amplify and exacerbate many profound issues that has
plagued human society for ages, and we must commit
to study and forecast and guide AI's impact
for human and society. And many students
and former students have contributed to
this part of the work. Let's talk about building AI
to see what humans want to see. And this is where
really putting humans more in the center of designing
technology to truly help us. When you hear the
word AI, well, you're kind of a biased audience. But when the
general public hears about AI today, what
is the number one thing that comes to their mind? Anxiety, right? A lot of that anxiety is
labor landscape, jobs. And this is very important. And if you go to
headlines of news, every other day we see that. But there is actually
a lot of cases where human labor
is in dire shortage. And again, this brings me back
to the health care industry that I also work with. America was missing at least
1 million nurses last year. And the situation is
just worse and worse. I talked about the
medical error situation in our health care system. The aging society is
exacerbating the issue of lack of caretakers. And a lot of these burdens fell
on women and people of color in very unfair ways. Care-taking is not
even counted in GDP. So instead of thinking about
AI replacing human capability, it is really valuable to think
about AI augmenting humans, and to lift human
jobs, and to also give human a hand,
especially health care from a vision perspective. There are so many times
and so many scenarios that we're in the dark. We don't know how
the patient is doing. We don't know if the care
delivery is high quality. We don't know where that
small instrument was missing in the surgical room. We don't know if we're making
a pharmaceutical error that might have dire consequences. So in the past 10 years, my
lab and I and my collaborators have started this fairly
new area of research called ambient intelligence
for health care, where we use smart sensors, mostly
depth sensors and cameras, and machine learning
algorithms to glean health critical insights. Most of this earlier
work was summarized in this Nature article
called "Illuminating the Dark Spaces of Healthcare
with Ambient Intelligence." I'll just give you a
couple of quick examples. One case study is hand hygiene. We started this work
way before COVID. Everybody thought this is
the most boring project. But when COVID came,
it became so important. It turned out that
hospital acquired infection kills three times
more people in America than car accidents every year. And a lot of that is because
of doctors and nurses carrying germs and
bacteria from room to room. So WHO has very specific
protocols for hand hygiene. But humans make mistakes. And now the way to
monitor that by hospitals is very expensive,
sparse, and disruptive. They put humans in front of-- I don't know the patient rooms,
and try to remind the doctors and nurses. You can see this is
completely non-scalable. So my students and I
have been collaborating with both Stanford Children's
Hospital and Utah's Intermountain Hospital by
putting depth sensors in front of these hand hygiene
gel dispensers, and then using video analysis
and activity recognition system to watch
if the health care workers are doing the right
thing for hand hygiene. And quantitatively,
the bottom line is the ground truth
of human behavior. You can see that the computer
vision algorithm's precision and recall is very high
compared to even human observers that we put in the hospital
in front of the hospital room. Another example is ICU
Patient Mobility Project where we getting patient to
move in the right way in the ICU is really important. It helps our
patients to recover. And on top of that,
ICU is so important. It's 1% of US GDP
is spent in ICU. Health care is 18%. So this is where patients
fight for death and life. And we want to help
them to recover. We work with Stanford
Hospital to put these sensors, again RGBD sensors in ICU rooms. And we study how the
patients are being moved. Some of the important movements
that doctors want patients to do include getting out
of bed, getting in bed, getting in chair, getting
out of chair, these things. And we can use computer
vision algorithm to help the doctors and nurses
to track these movements and so on. So this is, again,
a preliminary work. Last but not least,
aging in place. Aging is very important. But how do we keep our seniors
safe, healthy, but also independent in their living? How do we call out early
signs of whether it's infection or mobility change,
sleep disorder, dietary issues? There are so many things. It's computer vision
plays a big role in this. We are just starting
to collaborate actually with Thailand and
Singapore right now to get these computer
vision algorithms into the homes of seniors,
but also keeping in mind the privacy concerns. So these are just examples. Last but not the least,
I'm actually still very excited by the long
future where I think no matter what we do, we probably
will enter a world where robots
collaborate with humans to make our lives better. So ambient intelligence
is passive sensors. It can do certain things. But eventually I think embodied
AI will be very, very important in helping people, whether
it's firefighters, or doctors, or caretakers, or
teachers, or so on. And technically,
we need to close the loop between
perception and action to bring robots or
embodied AI to the world. Well, the gap is
still pretty high. This is a robot. I think-- I don't know. It's a Boston Dynamics
robot or some kind of robot. It's a pretty miserable
robot trying to put a box and miserably failed. And I know there are so many--
robotic research is also really progressing really fast. So it's not fair to just
show that one example. But in general, we are still
a lot of robotic learning and robotic research
right now is still on skill level tasks,
short horizon goals, and closed world instruction. I want to share with
you one work that at least was attempted
towards robotic learning to open world instruction. It's still not fully
closing all the gap, and I don't claim to do so. But at least we're
working on one dimension. And that is some of
you know our work VoxPoser, just released
half a year ago. Where we look at a
typical robotic task such as open the
door, or whatever, a robotic task in the wild. And the idea in today's robotic
learning is you give a task, and you try to give
a training set, and then you try to
train an action model. And then you test it. But the problem is,
how do you generalize? How do you hope in the
wild generalization? And how do you hope that
instruction can be open world? And here's the result.
The focus of this work is motion planning in the
wild or using open vocabulary. And the idea is
to actually borrow from large language models. From large language model,
to compose the task, and from also a
visual language model to identify the goal
and also the obstacles, and then use a code
generated 3D value map to guide to do motion planning. And I'm not going
to get into this. But quickly, so once the
robot takes the instruction, open the top drawer, you use
LLM to compose the instruction. And because the LLM helps you
to identify the objects as well as the actions, you can go use
a VLM, visual language model, to identify the objects
that you need in the world. Every time you do
that, you're starting to update a planning map. And it helps to, in this
case you identify the drawer. The maps sets some values
and it focuses on the drawer. And if you give it an additional
instruction of watch out for the vase, and it goes back
to LLM and goes back to VLM, and they identify the vase. And then it identifies
the planning path with the obstacle, and
updates the value map, and recomputes the
motion map, and do it recursively till it has
more optimized this. So this is the example we see
in simulation in real world. And there are several
examples of doing this for articulated objects,
deformable manipulations, as well as just everyday
manipulation tasks. OK, in the last
three minutes, let me just share with you one
more project, then we're done. Is that even with
VoxPoser, which I just showed you, and many
other projects in my lab, I always feel in
the back of my mind that compared to
where I come from, which is the visual
world, is these are very small scale data. Very small scale anecdotal
experimental setup, and there is no
standardization, and the tasks were more or less lab specific. And compared to the
real world which is so complex, so dynamic,
so variable, so interactive, and so multitasking
it's just unsatisfying. And how do we make progress
in robotic learning? Vision and NLP has
already shown us that large data drives
learning so much, and the kind of effective
benchmarking drives learning. So how do we combine
the goal of large data and effective benchmarking
for robotic learning has been something on my mind. And this is the new project
that we have been doing. Actually, it's not
so new anymore, for the past three
years called BEHAVIOR, benchmark for everyday
household activities in virtual interactive
ecological environments. And let me just
cut to the chase. Instead of small anecdotal tasks
that we want to train robots on, we want to do 1,000
tasks, 1,000 tasks that matter to people. So we started actually by
a human centered approach. We literally go to thousands
of people and ask them, would you like a robot
to help you with-- so let's try this. Would you like a
robot to help you with cleaning kitchen floor? Yeah, sort of, mostly. OK. Shoveling snow? Yeah. Folding laundry? AUDIENCE: Yeah. FEI-FEI LI: Yeah, OK. Cooking Breakfast [INTERPOSING VOICES] FEI-FEI LI: OK, I don't know. I get mixed-- Ranjay wants everything. I get mixed reviews. OK, this one, opening
Christmas gift? AUDIENCE: No. FEI-FEI LI: Right, yeah exactly. OK, I'm glad you're
not a robot, Ranjay. So we actually took this
human centered approach. We went to the government
data of American and other countries
human's daily activities. We go to crowdsourcing platform
like Amazon Mechanical Turk. We ask people what
they want robots to do. And we rank thousands of tasks. And then we look at what
people want help with, and what people
don't want help with. It turned out cleaning, all
kinds of cleaning people hate. But opening Christmas gift
or buying a ring, or mix baby cereals, is actually really
important for humans. We don't want robots help. So we took the top
1,000 tasks that people want robots help,
and put together the list for BEHAVIOR data set. And then we actually scanned
50 real world environments across eight different things,
like apartments, restaurants, grocery stores,
offices, and so on. And this compared to one of
my favorite works from UW, Object Verse, is very small. But we got thousands and
thousands of object assets. And we created a
simulation environment. OK, all right. I want to actually give
credits to a lot of good work that came out of UW
and many other places. So robotic simulation
is actually a very interesting area of
research and excellent work, like Ai2THOR, Habitat,
Sapien has been also making a lot of contribution. We collaborated with NVIDIA,
especially the Omniverse group, to try to focus on creating
a realistic simulation environment for
robotic learning that has the good physics, like
thermal transitional lighting and all that; good perception
which we did some user studies to show
that we have very good perceptual experience;
and also just interactions. And I'm not going to get
into all the details. We did some comparisons
and show the strength of this BEHAVIOR environment for
training 1,000 robotic tasks. And right now we are working
on a whole bunch of work that is involving
benchmarking, robotic learning, multi-sensory robotics,
and even economic studies on the impact of
household robots. And OK, I actually want to say
one thing I'm not showing here. Is that we are
actually doing brain robotic interfacing,
using BEHAVIOR environment to use EEG to drive robotic
arms to show the brain robot interface. And that was just
published this quarter. So I didn't include this slide. So BEHAVIOR is becoming a
very rich research environment hopefully for our community,
but at least for our lab's robotic work. And of course, the
goal is one day we'll close the gap between
robotics and collaborative robots, home robots
that can help people. And this part of the
research is really trying to identify
problems, whether it's health care or
embodied AI, where we want to build the
AI to see and also to do what humans want it
to, whether it's helping patients or helping elderlies. And I think that's
the key emphasis is really augmentation. And a lot of collaborators
have participated in this part of the work. This really summarizes the
three phases of our work or three different types of
our work, and all of this have accumulated to what I
would call a human centered AI approach, where we recognize
it's so important to develop AI with a concern for human impact. It's so important to focus AI
to augment and enhance humans. And it's actually
intellectually still important to be inspired by
human intelligence and cognitive sciences
and neurosciences. And that was really the
foundation of Stanford's Human Centered AI Institute that
I co-founded and launched five years ago with faculty from
English, Medicine, Economics, Linguistics, Philosophy,
Political Science, Law Schook, and all that. And HAI has been around
for five years now almost. We do work from digital
economy to Center for Research for
Foundational Models, where some of our workers-- like Percy, Chris--
you guys all know them-- are at the forefront
of benchmarking and evaluating today's LLMs. And we also work with faculty
like Michael Bernstein, some of him very well, on creating
ethics and society review process for AI research. And we also focus on
educating not only ethics focused AI
to our undergrads, but also really bring that
education to the outside world, especially for policymakers,
as well as business executives. And we directly engage with
the national policy, Congress and Senate and White House to
advocate for public sector AI investment,
especially right now. In fact, UW is
part of the partner and also senators
from Washington state are extremely important for this
is to advocate the next bill for national AI research cloud. So this really
concludes my talk. That was a pretty
dense quick overview of a human centered
approach to AI, and I'm happy to take questions. [APPLAUSE] One more slide. PRESENTER: We have time
for maybe two questions. AUDIENCE: What do you think the
most interesting breakthrough in the next 5 or 10 years
is going to be in computing? FEI-FEI LI: The
question is, what do I think the most
interesting breakthrough in the next 5 or 10 years. I just told you in the talk,
I'm so bad at predicting. So I think the two things
that does excite me, one is really just
deepening AI's impact to so many applications
in the world. It's not necessarily yet
another transformer or anything. It's just that we have
gotten to a point, the technology has so
much power and capability. We can use this to do
scientific discovery, to make education
more personalized, to help health care, to map out
the biodiversity of our globe. So I think that deepening and
widening of AI applications or from an academic
point of view, that deepening and widening
of interdisciplinary AI is one thing that really excites
me for the next 5 to 10 years. On the technology side,
I'm totally biased. I think computer vision is
due for another revolution. We're at the cusp of it. There's just so much
that is converging. And I'm really excited to
see the next wave of vision breakthroughs. PRESENTER: Go ahead. AUDIENCE: So large
language models have been impressive
because of what they have been able to do
with semantic understanding. What do you think the frontier
for image, computer vision is in that respect? FEI-FEI LI: Yeah. This is a very good question. The question is
large language model is really encoding
semantics so well. What's the frontier of image? So let me just say something. First of all, the world is
fundamentally very rich. Its language-- Ranjay,
don't yell at me. I still think language is a
lossy compression of the world. It is very rich. It goes beyond just
describing the world. It goes into reasoning,
abstraction, creativity, intention, and all this. But much of language is
symbolic, is a compression. Whereas the world itself in
3D in 4D is very, very rich. And I think there
needs to be a model. The world deserves a model. Not just language
deserves a model. There needs to be a
new wave of technology that really
fundamentally understands the structure of the world. PRESENTER: OK, we have
time for one more. Go ahead. AUDIENCE: I really
agree that language can be lossy, like
compression of the real world. I'm just wondering, what's
your opinion on just how English as a whole is just
so much like dominating the research field
itself, like all these labeled data sets are labeled
in English, while other language might have different ways
of describing objects, describing the relationship
between objects? That lack of diversity,
how do you feel about it? FEI-FEI LI: Right. So the question is about bias of
English in our dominating data sets of our AI. I think you're calling out a
very important aspect of what I call the inherited
human bias, right? Our data sets inherit
that kind of bias. I do want to say one thing. This is not meant for defense. It's a fun fact that when we
were constructing ImageNet, because the ImageNet was-- George Miller made this lexicon
taxonomy in many languages. It was so nice and easy to map
the synsets of English ImageNet to French, Italian,
Spanish, Portuguese. I think there are also
Asian languages we used. And so even though ImageNet
seemed English to you. The data comes
from all languages, we could get our
hands on the license. But that doesn't really solve
the problem you're saying. I think you're right. I mean we have to
be really mindful, even in the BEHAVIOR
data set, when we're looking at human
daily activities, we started with the
US government data. We realized we're very biased. First of all, you realize
you're biased because there's so much TV watching in the data. And then we actually
went to Europe. But that does not
include the global South. So we're definitely
still very biased. PRESENTER: OK, I think
that's all the time we have. Let's thank Fei-Fei. FEI-FEI LI: Thank you. [APPLAUSE]