What we see and what we value: AI with a human perspective—Fei-Fei Li (Stanford University)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

PRESENTER: Welcome, everybody. I'm very excited to welcome Fe-Fei Li today. And of course, judging by how packed this room is, Fei-Fei doesn't really need an introduction. And of course, if I actually were to introduce her by reading her bio, it would take a majority of today's time. So I'll keep this brief. Fei-Fei is a professor at Stanford, where she was also my PhD advisor. She's the director of the Human Centered AI Institute at Stanford. And during the years of 2017 and 2018, she was also the Vice President at Google, as well as the Chief Scientist of AI and Machine Learning at Google Cloud. She, of course, has published hundreds of papers. And perhaps one of the ones that a lot of people know her for is ImageNet, which as you all know, has ushered in the deep learning AI revolution that we're all in today. She also serves as a special advisor to the Secretary General of the United Nations, and is also a member of the National AI Resource Task Force for the White House Office of Science and Technology. And recently, she also has published her book titled, The World I See-- Curiosity, Exploration, and Discovery at the Dawn of AI. And I'm sure she'll be talking about parts of that book today. So with that, Fei-Fei, welcome to the University of Washington. FEI-FEI LI: Thank you. [APPLAUSE] Thank you. Thank you. Well, it's quite an honor to be here. Actually it's as a professor one of the greatest joys and honor is to work with the best students, and see how their career has grown. And so being invited by Ranjay and his colleagues is really very special. And I'm just loving all the energy I've seen throughout the day today. So OK, I want to share with you a talk that is a little bit meant at the high level and an overview of what I have done over the years, through the lens of computer vision and the development of AI. So the title is "What We See & What We Value-- AI with a Human Perspective." I'm going to take you back to history a little bit. And when I say history, I meant 540 million years ago. So 540 million years ago, what was that? Well, the Earth is a primordial soup. And it's all living things live in the water. And there aren't that many of them. They are just simple animals floating around. But something really strange happened in geologically a very short period of time, about 10 million years, is from fossil studies scientists have found there is an explosion of the number of animal species around that time. So much that that period is called the Cambrian Explosion or some people call it the Big Bang of evolution. And so what happened? Why suddenly when life was so chill and simple, not too many animals, why life went from that picture to an explosive number of animal species? Well, there are many theories, from climate change to chemical composition of the water, to many things. But one of the leading theories of that Cambrian Explosion is by Andrew Parker, a zoologist from Australia. He conjectured that this speciation explosion is triggered by the sudden evolution of vision, which sets off an evolutionary arms race where animals either evolved or died. Basically, he's saying as soon as you see the first light, you see the world in fundamentally different ways. You can see food. You can see shelter. You can become someone's food. And they would actively prey on you. And you have to actively interact and engage with the world in order to survive and reproduce, and so on. So from that point on, 540 million years to today, vision, visual intelligence has become a cornerstone of the development and the evolution of nervous system of animal intelligence. All the way to, of course, the most incredible visual machine we know in the universe which is the human vision. And whether we're talking about people and many animals, we use vision to navigate the world, to live life, to communicate, to entertain ourselves, to socialize, to do so many things. Well, that was a very brief history of nature's vision. What about computer vision? The history of computer vision is a little shorter than evolution. Urban legend goes around 60 years ago, 1966 I think, that there was one ambitious MIT professor who said, well, AI as a field has been born. And it looks like it's going well. I think we can just solve vision in a summer. In fact, we'll solve vision by using our summer workers, undergrads, and we'll just spend this one summer to create or construct a significant part of visual system. This is not a frivolous conjecture. I actually sympathize with him. Because for humans when you open your eyes, it feels so effortless to see. It feels that as soon as you open your eyes, the whole world's information is in front of you. So it might be turned out to be an underestimation of how hard it is to construct the visual system. But it was a heroic effort. They didn't solve vision in a summer, not even a tiny bit of vision. But 60 years later, vision today has become a very thriving field, both academically as well as in our technology world. I'm just showing you a couple of examples of where we are. Right? We have visual applications everywhere. We're dreaming of self-driving cars, which hopefully will happen in our lifetime. We are using image classification or image recognition and so many image technologies for many things from, health care to just daily lives. And generative AI has brought a whole new wave of visual applications and breakthroughs. So the rest of the talk is organized to answer this question. Where have we come from and where are we heading to? And I want to share with you three major theses of the work that I have been doing in my career in recent few years, and just to share with you what I think. Let's begin with building AI to see what humans see. Why do we do that? Because humans are really good at seeing. This is a 1970s cognitive science experiment to show you how good humans are. Every frame is refreshed at 10 Hertz, 100 milliseconds of presentation. If I ask you as audience, I assume given how young you are-- you're not even born then-- you've never seen this video. Nod your head when you see one frame that has a person in it. You will see it. Yeah, OK. You've never seen this video. I didn't tell you what the person looked like. I didn't tell you which frame it will appear. You have no idea-- the gesture, the clothes, everything about this. Yet, you're so good at detecting this person. Around the turn of the century, a group of French researchers have put a time on this effortlessness. It turned out seeing complex objects or complex categories for humans is not only effortless and accurate, it's fast. 150 milliseconds after the onset of a complex photo, either containing animals or not containing animals, humans you can measure brain signal that shows that differential signal of pictures, of scene pictures with animals and scene pictures without animals. It means that it takes about 150 milliseconds in our wetware, right here, from the photons landing on your retina to the decision that you can make accurately. I know this sounds slow for silicons. But for our brain, for those of you who come from a little bit of neuroscience background, this is actually super fast. It takes about 10 stages of spikes from passing from neuron to neuron to get here. So it's a very interesting measurement. At around the same time, neurophysiologists, so we've had psychologists telling us humans are really good at seeing objects. We've got neuroscientists telling us not only we're good at it, we're fast. Now, this last set of study, also neurophysiologists use MRI study to tell us, because evolution has optimized recognition so much that we have dedicated neural correlates in the brain, areas that specializes in visual recognition. For example, the fusiform face area, or the parahippocampal place area-- these are areas that we see objects and scenes. So what all this has told us, this research from the '70s, '80s, and '90s have told us, is that objects are really important for visual intelligence. It's a building block for people. And it's become a North Star for what vision needs to do. It's not it's not all the North Stars, but it's one important North Star. And that has guided the early phase of my own research as well as the field of computer vision. As a field, we identified that object recognition, object categorization, is an important problem. And it's a mathematically really challenging problem. It's effortless for us. But to recognize, say, a cute animal wombat, you actually have mathematically infinite way of rendering this animal wombat from 3D to the 2D pixels, whether it's lighting and texture variations, or background clutter and occlusion variations, or viewing angle camera angle occlusions, and so on. So it's mathematically a really hard problem. So what did we do as a field? I summarized the progress of object recognition in three phases. The first phase was concurrent. It's a very early phase, concurrent with this cognitive studies is what I call the hand-designed features of models. This is where very smart researchers use their own sheer power of their brain to design the kind of building blocks of objects, as well as the model, the parameters, and so on. So we see Geon theory. We see generalized cylinder. We see parts and springs models. And these are in the '70s, '80s, or early '90s. They're beautiful theory. They're mathematically beautiful models. But the thing is, they don't work. They're theoretically beautiful. Then there's a second phase, which I think is the most important phase actually, leading up to deep learning, which is machine learning. It's when we have introduced machine learning as a statistical modeling technique, but the input of these models are hand-designed features like patches, and parts of objects that are meant to carry a lot of semantic information. And the idea is that in order to recognize something like a human body, or a face, or whatever, a chair-- it's important to get these patches that contains ears and eyes and whatever. And then you use machine learning models to learn the parameters that stitch them together. And this is when the whole field has experimented with many different kinds of statistical models from Bayes Net, support vector machine, boosting, conditional random field, random forest, and neural network. But this is the first phase of that. Something also important happened concurrently with this phase is actually the recognition of data. In the early years of the 21st century, the field of computer vision recognized it's important to have benchmarking data sets, data sets like the PASCAL VOC data set, the Caltech 101 data set. That is meant to measure the progress of the field. And it turned out they can also become some level of training data. But they're very small. They're in the order of hundreds and thousands of pictures, and a handful of object categories. Personally for me, this was around the time I stumbled upon a very incredible number. I call it, if you read my book, I call it the Biederman number. Professor Biederman who sadly just passed away a year ago, is a cognitive psychologist studying vision and thinking about the scale and scope of human visual intelligence. And back of envelope, he put a guesstimate of humans can recognize 30 to 100,000 object categories in their lives. And it's not a verified number. It's very hard to verify. This is a conjecture in one of his papers. And he also went on to say that by age 6, you actually learn pretty much all the visual categories that a grown-up has learned. This is an incredible speed of learning, a dozen a day or so. So this number bugged me a lot because it just doesn't compare to all the data sets we've seen at that point. And that was the reason, the inception of ImageNet, that we recognized, my students, Jordan, and collaborators, and I recognize that there's a new way of thinking about visual intelligence. It's deeply, deeply data driven. And it's not just the size of the data. It's the diversity of data. And this is really history. You all know what ImageNet is. And it also brought back the most important family of algorithm that is high capacity, and needs to be data driven, which is convolutional or neural network algorithm. And in the case of vision, we started with convolutional neural network. For those of you who are very young students, you probably don't remember this. But even when I was a graduate student at the turn of the century, convolutional neural network was considered a classic algorithm, meaning it was pretty old. And it didn't work. But we still studied it when I was a graduate student. It was incredible to see how data and the new techniques revitalized this whole family of algorithms. And for this audience, I'm going to skip. This is really too trivial. But what happened is that this brought us the third phase of object recognition. And I would say more or less, quite a triumphant phase of object recognition, where using big data as training and convolutional neural network, we're able to recognize objects in the wild in a way that the first two phases couldn't. And these are just examples. And of course, the most incredible moment, even for myself who was behind ImageNet, was 2012 when Professor Geoff Hinton and his students, very famous students, have written this defining paper as the beginning of the deep learning revolution. And ever since then, vision as a field and ImageNet as a data set has really been driven a lot of the algorithm advances in the pre-transformer era of deep learning. And very proudly as a field, even work like RESNET, were the precursors of many of the attention is all you need paper. So vision as a field has contributed a lot to deep learning evolution. OK, so let me fast forward. As researchers, after ImageNet, we were thinking about what is beyond object recognition. And this is really Ranjay's thesis work, is that the world is not just defined by object identities. If it were, these two pictures both contain a person and a llama, would mean the same thing. But they don't. I'd rather be the person on the left than the person on the right. Actually, I'd rather be the llama on the left than the llama on the right as well. So objects are important, but relationships, context, and structure and compositionality of the scene are all part of the richness of visual intelligence. And the image, that was not enough to push forward this kind of research. So again, heroically Ranjay was really the key student who was pushing a new way of thinking about images and visual representation, mostly focusing on visual relationships. So the way Ranjay and we put together the next wave of work was through scene graph representation. We recognize the entities of the scene in the unit of objects, but also their own attributes as well as the inter-entity relationships. And we made a data set-- it was a lot of work-- called Visual Genome. That consisted of hundreds of thousands of images, but millions of relationships, attributes, objects, and even natural language descriptions of the images as a way to capture the richness of the visual world. There are many works that came out of Visual Genome, and a lot of them were written by Ranjay. But one of my favorite works is this one-shot learning of visual relationships that Ranjay did where you use the compositionality of the objects to learn relationships like people riding horse, people wearing hats. But what comes out of it with the compositionality is almost for free, is the capability of recognizing long-tail relationships that you will never have enough training examples. But you're able to do it during inference, which is like horse wearing hat, or person sitting on fire hydrant. And that really taps into the relationship as well as the compositionality of images. And yeah, there were some quantitative measurement that shows our work at that time-- now it's ancient time-- that does better than the state of the art. We also went beyond just a contrived labeling of objects or relationships that went into natural language. And there was a series of papers started with my former student Andre Karpathy, many of you know, Justin Johnson, on image captioning, dense captioning, paragraph generation. I want to say one thing that shows you how badly at least me or oftentimes scientists predicts the future. When I was a graduate student, when I was about to graduate, 2005, I remember it was very clear to me my life dream as a computer vision scientist was to, when I die, I want to see computers can tell a story from a picture. That was my life's dream. I feel that if we can put a picture into the computer and it will tell us what's happening, a story, we've achieved the goal of computer vision. I never dreamed less than 10 years, just around 10 years after my graduation, this dream was realized collectively, including my own lab, by LSTM at that point, and CNNs. It was just quite a remarkable moment for me to realize. First of all, it's kind of the wrong dream to say that that's the end of the computer vision achievement. Second, I didn't know how fast it would come. So be careful what you dream of. That was the moral of the story. But static relationships are easier. Real world is full of dynamic relationships. Dynamic relationships are much more nuanced and more difficult. So this is fairly recent work. It was I think at NeurIPS two years ago. And we're still doing this work on multi-object, multi-actor activity recognition or understanding. And that is an ongoing work. I'm not going to get into the technical details. But the video understanding, especially with this level of nuance and details, still excites me. And it's an unsolved problem. I also want to say that vision as a field has been exciting, not only because I'm doing some work in it. It's because some other people's work. And none of these are my own work. But I find that the recent progress in 3D vision, in pose estimation, in image segmentation, with Facebook SAM and all the generative AI work has been just incredible progress. So we're not done with building AI to see what humans see. But we have gone a long way. And part of that is the result of data, compute, algorithms, like neural networks that really brought this deep learning revolution. And as a computer vision scientist, I'm very proud that our field has contributed to this. And AI's development has been and I continue to believe will be inspired by brain sciences and human cognition. And for this section, I'm very appreciative of all the collaborators, current and former students, and Ranjay you're a part of them, who has contributed. Let's just fast forward to building AI to see what humans don't see. Well, I just told you humans are super good. But I didn't tell you that we're not good enough. For example, I don't know about you, but I don't think I can recognize all these dinosaurs. And in fact, recognizing very fine-grained objects is not something humans are good at. There are more than 10,000 types of birds in the world. We put together or we got our hands on a data set of 4,000 types of birds. And humans typically fail miserably in recognizing all species of birds. And this is an area called fine-grained object categorization. And in fact, it's quite exciting to think about computers at this point can go beyond human ability to train detectors, object detectors, that can do much finer grain understanding of objects beyond humans. And one of the application papers we did which I find very fascinating, is a fine-grained car recognition. We downloaded 3,000 types of cars, separated by make, model, year that's ever built by 1970s, starting 1970s. We stopped before Tesla was popular. So people always ask me this question. Where's Tesla? We don't have Tesla. And after we trained the fine-grained object detector for thousands of cars, 3,000 of cars, we downloaded street view pictures of 100 American cities, most populated cities, two per state. And we also correlated this with all the census data that came out of 2010. And it's incredible to see the world through vision as a lens, the correlation between car detection and human society is stunning, including income, including education level, including voting patterns. We have a long paper that has dozens and dozens of these correlations. So I just want to show you that even though we don't see it with our individual eyes, but computers can help us see our world, see our society through these kind of lenses in ways that humans can't. OK, to drive home this idea that humans are not that good, even though 10 minutes ago I told you're so good, is this visual illusion called Stroop test. Try to read out to yourself the color of the word, not the word itself. Go left to right and top to bottom, as fast as possible. It's really hard, right? I have to do red, orange, green, blah, blah, blah. That's a fun visual illusion. This one some of you probably have seen. These are two alternating pictures. They look like the same but there's one big chunk that's different. Can you tell? Raise your hand if you can. It's an IQ test. [LAUGHTER] m so all the faculty were thinking, oh no. I didn't raise my hand. OK, so it's the engine. Oh. OK, so it's a huge chunk. This has landed on your retina. And you completely missed it. OK, good job. [LAUGHTER] It's not that funny, if it's in the real world, when it's a high stake situation. Whether you're passing through airport security or doing surgeries. So actually not seeing can have dire consequences. Medical error is the third-leading cause of American patients' deaths annually. And in surgery rooms, accounting for all the instruments and glasses and all that is actually a critical task. If something is missing, on average a surgery will stop for more than one hour, so that the nurses and doctors have to identify where the thing is, and think about all the life risk to the patient. And what do we do today? We use hand and count. And imagine if we can use computer vision to automatically assist our doctors and nurses to account for small instruments in a surgical setting. That would be very helpful. And this is an ongoing collaboration between my lab's health care team and Stanford Hospital Surgery Department. This is a demo of accounting for these glasses during a surgical scenario. And this would, if this becomes mature technology, I really hope this would have really good application for these kind of uses. Sometimes seeing is not just attention. Every example I just showed you there seemed to be attentional deficit. But sometimes seeing is more profound, or not seeing is more profound. This is my really favorite visual illusion, since I was a graduate student, made by Ted Edison at MIT. And I'm just showing you the answer. This checkerboard illusion, if you look at the top graph checkerboard A and B, no matter what I tell you they look like different gray scales, right? I mean, how could they on Earth have the same gray scale. But if I added this, you see that they're the same gray scale. So this is a visual illusion. Even if you know the answer, It's hard to not be tricked by your eyes. For those of you who are old enough, who do you see here? AUDIENCE: Bill Clinton and Al Gore. FEI-FEI LI: Clinton and Gore, right? Is it? Is it Clinton and Gore? So it turned out they are Clinton and Clinton. And it's a copy of Clinton's face in Gore's hair, and in a context, that it is very primed for all of us to see them as Clinton and Gore. So being primed is a fundamental thing of human bias. And in computer vision, we have also inherited, if we're not careful, computer vision has inherited human bias, especially through data sets. So Joy Buolamwini used to be at MIT, had written this beautiful poem that exposes the bias of computer vision. So I'm not nearly as a leading expert as Joy and many other people are. But it's important to point out that not seeing has consequences. And we need to work really hard to combat these biases that creep into computer vision and AI systems. And these are just really examples of hundreds and hundreds of thousands of papers and work people are doing in combating biases. Now on the flip side, sometimes not seeing is a must, as seeing too much is also really bad. This brings us to the value of privacy. And my lab has been actually doing quite a bit in the context of health care, but quite a bit of privacy computing in the past few years in terms of how we can protect human dignity, human identity, in computer vision context. One of my favorite works that's not led by me is by Juan Carlos Niebles. That combines both hardware and software to try to protect human privacy while still recognizing human behaviors that are important. The idea is the following. If you want to look at what humans do, you take a camera you shoot a video and you analyze it. In this case, a baby is pushing a box. What if you don't want to reveal this kid? What if you don't want to reveal the environment? Can you design a lens that blurs the raw signal, like you never take the pure pixel signal? What if the designed lens gives you a signal like that? So for humans, you don't even see the baby. Well, that's exactly what they did. They designed a warped lens. And the lens gives you a raw signal in the top row. But they also designed an algorithm that retrieves not super resolution, they have no intention to recover the identity of the people, but just to recover the activity they need to know. This way their combined hardware-software approach not only protects privacy, but also reads out the insight that whether you're in transportation cases or health care cases, that is relevant to the application users. So building AI to see what humans don't see is part of computer vision's quest. It's also important to recognize sometimes what humans don't see is bad, like bias. But we also want to make computer not see the things that we want to preserve privacy for. So in general, AI can amplify and exacerbate many profound issues that has plagued human society for ages, and we must commit to study and forecast and guide AI's impact for human and society. And many students and former students have contributed to this part of the work. Let's talk about building AI to see what humans want to see. And this is where really putting humans more in the center of designing technology to truly help us. When you hear the word AI, well, you're kind of a biased audience. But when the general public hears about AI today, what is the number one thing that comes to their mind? Anxiety, right? A lot of that anxiety is labor landscape, jobs. And this is very important. And if you go to headlines of news, every other day we see that. But there is actually a lot of cases where human labor is in dire shortage. And again, this brings me back to the health care industry that I also work with. America was missing at least 1 million nurses last year. And the situation is just worse and worse. I talked about the medical error situation in our health care system. The aging society is exacerbating the issue of lack of caretakers. And a lot of these burdens fell on women and people of color in very unfair ways. Care-taking is not even counted in GDP. So instead of thinking about AI replacing human capability, it is really valuable to think about AI augmenting humans, and to lift human jobs, and to also give human a hand, especially health care from a vision perspective. There are so many times and so many scenarios that we're in the dark. We don't know how the patient is doing. We don't know if the care delivery is high quality. We don't know where that small instrument was missing in the surgical room. We don't know if we're making a pharmaceutical error that might have dire consequences. So in the past 10 years, my lab and I and my collaborators have started this fairly new area of research called ambient intelligence for health care, where we use smart sensors, mostly depth sensors and cameras, and machine learning algorithms to glean health critical insights. Most of this earlier work was summarized in this Nature article called "Illuminating the Dark Spaces of Healthcare with Ambient Intelligence." I'll just give you a couple of quick examples. One case study is hand hygiene. We started this work way before COVID. Everybody thought this is the most boring project. But when COVID came, it became so important. It turned out that hospital acquired infection kills three times more people in America than car accidents every year. And a lot of that is because of doctors and nurses carrying germs and bacteria from room to room. So WHO has very specific protocols for hand hygiene. But humans make mistakes. And now the way to monitor that by hospitals is very expensive, sparse, and disruptive. They put humans in front of-- I don't know the patient rooms, and try to remind the doctors and nurses. You can see this is completely non-scalable. So my students and I have been collaborating with both Stanford Children's Hospital and Utah's Intermountain Hospital by putting depth sensors in front of these hand hygiene gel dispensers, and then using video analysis and activity recognition system to watch if the health care workers are doing the right thing for hand hygiene. And quantitatively, the bottom line is the ground truth of human behavior. You can see that the computer vision algorithm's precision and recall is very high compared to even human observers that we put in the hospital in front of the hospital room. Another example is ICU Patient Mobility Project where we getting patient to move in the right way in the ICU is really important. It helps our patients to recover. And on top of that, ICU is so important. It's 1% of US GDP is spent in ICU. Health care is 18%. So this is where patients fight for death and life. And we want to help them to recover. We work with Stanford Hospital to put these sensors, again RGBD sensors in ICU rooms. And we study how the patients are being moved. Some of the important movements that doctors want patients to do include getting out of bed, getting in bed, getting in chair, getting out of chair, these things. And we can use computer vision algorithm to help the doctors and nurses to track these movements and so on. So this is, again, a preliminary work. Last but not least, aging in place. Aging is very important. But how do we keep our seniors safe, healthy, but also independent in their living? How do we call out early signs of whether it's infection or mobility change, sleep disorder, dietary issues? There are so many things. It's computer vision plays a big role in this. We are just starting to collaborate actually with Thailand and Singapore right now to get these computer vision algorithms into the homes of seniors, but also keeping in mind the privacy concerns. So these are just examples. Last but not the least, I'm actually still very excited by the long future where I think no matter what we do, we probably will enter a world where robots collaborate with humans to make our lives better. So ambient intelligence is passive sensors. It can do certain things. But eventually I think embodied AI will be very, very important in helping people, whether it's firefighters, or doctors, or caretakers, or teachers, or so on. And technically, we need to close the loop between perception and action to bring robots or embodied AI to the world. Well, the gap is still pretty high. This is a robot. I think-- I don't know. It's a Boston Dynamics robot or some kind of robot. It's a pretty miserable robot trying to put a box and miserably failed. And I know there are so many-- robotic research is also really progressing really fast. So it's not fair to just show that one example. But in general, we are still a lot of robotic learning and robotic research right now is still on skill level tasks, short horizon goals, and closed world instruction. I want to share with you one work that at least was attempted towards robotic learning to open world instruction. It's still not fully closing all the gap, and I don't claim to do so. But at least we're working on one dimension. And that is some of you know our work VoxPoser, just released half a year ago. Where we look at a typical robotic task such as open the door, or whatever, a robotic task in the wild. And the idea in today's robotic learning is you give a task, and you try to give a training set, and then you try to train an action model. And then you test it. But the problem is, how do you generalize? How do you hope in the wild generalization? And how do you hope that instruction can be open world? And here's the result. The focus of this work is motion planning in the wild or using open vocabulary. And the idea is to actually borrow from large language models. From large language model, to compose the task, and from also a visual language model to identify the goal and also the obstacles, and then use a code generated 3D value map to guide to do motion planning. And I'm not going to get into this. But quickly, so once the robot takes the instruction, open the top drawer, you use LLM to compose the instruction. And because the LLM helps you to identify the objects as well as the actions, you can go use a VLM, visual language model, to identify the objects that you need in the world. Every time you do that, you're starting to update a planning map. And it helps to, in this case you identify the drawer. The maps sets some values and it focuses on the drawer. And if you give it an additional instruction of watch out for the vase, and it goes back to LLM and goes back to VLM, and they identify the vase. And then it identifies the planning path with the obstacle, and updates the value map, and recomputes the motion map, and do it recursively till it has more optimized this. So this is the example we see in simulation in real world. And there are several examples of doing this for articulated objects, deformable manipulations, as well as just everyday manipulation tasks. OK, in the last three minutes, let me just share with you one more project, then we're done. Is that even with VoxPoser, which I just showed you, and many other projects in my lab, I always feel in the back of my mind that compared to where I come from, which is the visual world, is these are very small scale data. Very small scale anecdotal experimental setup, and there is no standardization, and the tasks were more or less lab specific. And compared to the real world which is so complex, so dynamic, so variable, so interactive, and so multitasking it's just unsatisfying. And how do we make progress in robotic learning? Vision and NLP has already shown us that large data drives learning so much, and the kind of effective benchmarking drives learning. So how do we combine the goal of large data and effective benchmarking for robotic learning has been something on my mind. And this is the new project that we have been doing. Actually, it's not so new anymore, for the past three years called BEHAVIOR, benchmark for everyday household activities in virtual interactive ecological environments. And let me just cut to the chase. Instead of small anecdotal tasks that we want to train robots on, we want to do 1,000 tasks, 1,000 tasks that matter to people. So we started actually by a human centered approach. We literally go to thousands of people and ask them, would you like a robot to help you with-- so let's try this. Would you like a robot to help you with cleaning kitchen floor? Yeah, sort of, mostly. OK. Shoveling snow? Yeah. Folding laundry? AUDIENCE: Yeah. FEI-FEI LI: Yeah, OK. Cooking Breakfast [INTERPOSING VOICES] FEI-FEI LI: OK, I don't know. I get mixed-- Ranjay wants everything. I get mixed reviews. OK, this one, opening Christmas gift? AUDIENCE: No. FEI-FEI LI: Right, yeah exactly. OK, I'm glad you're not a robot, Ranjay. So we actually took this human centered approach. We went to the government data of American and other countries human's daily activities. We go to crowdsourcing platform like Amazon Mechanical Turk. We ask people what they want robots to do. And we rank thousands of tasks. And then we look at what people want help with, and what people don't want help with. It turned out cleaning, all kinds of cleaning people hate. But opening Christmas gift or buying a ring, or mix baby cereals, is actually really important for humans. We don't want robots help. So we took the top 1,000 tasks that people want robots help, and put together the list for BEHAVIOR data set. And then we actually scanned 50 real world environments across eight different things, like apartments, restaurants, grocery stores, offices, and so on. And this compared to one of my favorite works from UW, Object Verse, is very small. But we got thousands and thousands of object assets. And we created a simulation environment. OK, all right. I want to actually give credits to a lot of good work that came out of UW and many other places. So robotic simulation is actually a very interesting area of research and excellent work, like Ai2THOR, Habitat, Sapien has been also making a lot of contribution. We collaborated with NVIDIA, especially the Omniverse group, to try to focus on creating a realistic simulation environment for robotic learning that has the good physics, like thermal transitional lighting and all that; good perception which we did some user studies to show that we have very good perceptual experience; and also just interactions. And I'm not going to get into all the details. We did some comparisons and show the strength of this BEHAVIOR environment for training 1,000 robotic tasks. And right now we are working on a whole bunch of work that is involving benchmarking, robotic learning, multi-sensory robotics, and even economic studies on the impact of household robots. And OK, I actually want to say one thing I'm not showing here. Is that we are actually doing brain robotic interfacing, using BEHAVIOR environment to use EEG to drive robotic arms to show the brain robot interface. And that was just published this quarter. So I didn't include this slide. So BEHAVIOR is becoming a very rich research environment hopefully for our community, but at least for our lab's robotic work. And of course, the goal is one day we'll close the gap between robotics and collaborative robots, home robots that can help people. And this part of the research is really trying to identify problems, whether it's health care or embodied AI, where we want to build the AI to see and also to do what humans want it to, whether it's helping patients or helping elderlies. And I think that's the key emphasis is really augmentation. And a lot of collaborators have participated in this part of the work. This really summarizes the three phases of our work or three different types of our work, and all of this have accumulated to what I would call a human centered AI approach, where we recognize it's so important to develop AI with a concern for human impact. It's so important to focus AI to augment and enhance humans. And it's actually intellectually still important to be inspired by human intelligence and cognitive sciences and neurosciences. And that was really the foundation of Stanford's Human Centered AI Institute that I co-founded and launched five years ago with faculty from English, Medicine, Economics, Linguistics, Philosophy, Political Science, Law Schook, and all that. And HAI has been around for five years now almost. We do work from digital economy to Center for Research for Foundational Models, where some of our workers-- like Percy, Chris-- you guys all know them-- are at the forefront of benchmarking and evaluating today's LLMs. And we also work with faculty like Michael Bernstein, some of him very well, on creating ethics and society review process for AI research. And we also focus on educating not only ethics focused AI to our undergrads, but also really bring that education to the outside world, especially for policymakers, as well as business executives. And we directly engage with the national policy, Congress and Senate and White House to advocate for public sector AI investment, especially right now. In fact, UW is part of the partner and also senators from Washington state are extremely important for this is to advocate the next bill for national AI research cloud. So this really concludes my talk. That was a pretty dense quick overview of a human centered approach to AI, and I'm happy to take questions. [APPLAUSE] One more slide. PRESENTER: We have time for maybe two questions. AUDIENCE: What do you think the most interesting breakthrough in the next 5 or 10 years is going to be in computing? FEI-FEI LI: The question is, what do I think the most interesting breakthrough in the next 5 or 10 years. I just told you in the talk, I'm so bad at predicting. So I think the two things that does excite me, one is really just deepening AI's impact to so many applications in the world. It's not necessarily yet another transformer or anything. It's just that we have gotten to a point, the technology has so much power and capability. We can use this to do scientific discovery, to make education more personalized, to help health care, to map out the biodiversity of our globe. So I think that deepening and widening of AI applications or from an academic point of view, that deepening and widening of interdisciplinary AI is one thing that really excites me for the next 5 to 10 years. On the technology side, I'm totally biased. I think computer vision is due for another revolution. We're at the cusp of it. There's just so much that is converging. And I'm really excited to see the next wave of vision breakthroughs. PRESENTER: Go ahead. AUDIENCE: So large language models have been impressive because of what they have been able to do with semantic understanding. What do you think the frontier for image, computer vision is in that respect? FEI-FEI LI: Yeah. This is a very good question. The question is large language model is really encoding semantics so well. What's the frontier of image? So let me just say something. First of all, the world is fundamentally very rich. Its language-- Ranjay, don't yell at me. I still think language is a lossy compression of the world. It is very rich. It goes beyond just describing the world. It goes into reasoning, abstraction, creativity, intention, and all this. But much of language is symbolic, is a compression. Whereas the world itself in 3D in 4D is very, very rich. And I think there needs to be a model. The world deserves a model. Not just language deserves a model. There needs to be a new wave of technology that really fundamentally understands the structure of the world. PRESENTER: OK, we have time for one more. Go ahead. AUDIENCE: I really agree that language can be lossy, like compression of the real world. I'm just wondering, what's your opinion on just how English as a whole is just so much like dominating the research field itself, like all these labeled data sets are labeled in English, while other language might have different ways of describing objects, describing the relationship between objects? That lack of diversity, how do you feel about it? FEI-FEI LI: Right. So the question is about bias of English in our dominating data sets of our AI. I think you're calling out a very important aspect of what I call the inherited human bias, right? Our data sets inherit that kind of bias. I do want to say one thing. This is not meant for defense. It's a fun fact that when we were constructing ImageNet, because the ImageNet was-- George Miller made this lexicon taxonomy in many languages. It was so nice and easy to map the synsets of English ImageNet to French, Italian, Spanish, Portuguese. I think there are also Asian languages we used. And so even though ImageNet seemed English to you. The data comes from all languages, we could get our hands on the license. But that doesn't really solve the problem you're saying. I think you're right. I mean we have to be really mindful, even in the BEHAVIOR data set, when we're looking at human daily activities, we started with the US government data. We realized we're very biased. First of all, you realize you're biased because there's so much TV watching in the data. And then we actually went to Europe. But that does not include the global South. So we're definitely still very biased. PRESENTER: OK, I think that's all the time we have. Let's thank Fei-Fei. FEI-FEI LI: Thank you. [APPLAUSE]

Info

Channel: Paul G. Allen School

Views: 48,125

Rating: undefined out of 5

Keywords: Paul G. Allen School of Computer Science & Engineering, University of Washington

Id: gzOwpEupP5w

Channel Id: undefined

Length: 60min 25sec (3625 seconds)

Published: Sun Jan 21 2024