Keynote Talk: Model Based Machine Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

>> It's my absolute pleasure to introduce Chris Bishop. Chris Bishop is a Microsoft Technical Fellow. He also is the Managing Director of our Cambridge Research Lab. He's also Professor of Computer Science at the University of Edinburgh and a Fellow of the Darwin College Cambridge. In 2014, he was elected Fellow of the Royal Academy of Engineering. And in 2007, he was elected Fellow of the Royal Society of Edinburgh. And in 2017, he was elected as a Fellow of the Royal Society. It is a long list ofachievements and accolades that I can go on talking about, right rather late, Chris' talk will do the talking. I'm sure he won't disappoint. So, without much further ado, Chris, all yours. >> Thank you very much. Thanks for the invitation to come here. It's a great privilege to be the final speaker. I thought what I do for this talk is, rather than talk about any particular application or particular algorithms, is to step right up to 50,000 feet. I think about machine learning, and what it is all about? What are we trying to achieve? And in particular, to give you a perspective on machine learning that I call Model-Based Machine Learning, which you can think of as a compass to guide you through this very complex world. So, Machine Learning can be very intimidating. There are many, many algorithms. Here are a few. Every year, next hundreds more are published. You've heard about lots today. And especially, if you're a newcomer to the field, it's bewildering, it's intimidating. Which ones you need to learn about? Which ones should you use for your application? It can be really challenging. So, Model-Based Machine Learning is just a perspective that I hope will help go into on your journey through machine learning, whether you're working on new algorithms or in particular, whether you're working on real-world applications. So, coming back to all of these algorithms, you might be a little frustrated and say, "Why do I have to learn about hundreds or thousands of different algorithms? Why can't these machine learning people just come up with the one universal algorithm?" In fact, maybe they have, maybe steep neural networks. "Have steep neural networks solve all of humanity's problems. I don't need to learn about the rest." Well, there's a mathematical theorem, it's proven. So, it's unlikely to be retracted anytime. And it's called the 'No Free Lunch' Theorem. And it's by Daniel Wolpert back in 1996. This is the averaged overall possible data-generating distributions. You can think about as no averaged overall the possible problems you could ever want to solve. Every classification algorithm has the same error rates when classifying previously unobserved points. That means if an algorithm is particularly good at one problem, it will be particularly bad at some other problem. Put it another way, there is no such thing, as a universal machine learning algorithm. That is not my personal opinion. It's a mathematical theorem. Well, to put it another way, the goal of machine learning is not to find the universal algorithm because it doesn't exist. But instead, to find an algorithm that is in some sense well matched to the particular problem that you're trying to solve. Okay, so this is very fundamental, very cart of machine learning. So, machine learning, we all know, depends on data. But we cannot learn just from data. We need to combine data with something else with a model or we can think of this as constraints or we can think of it as prior knowledge. I'll use the term prior knowledge, but people call it lots of different things. You cannot learn from data alone. Otherwise, we'll have a sort of universal algorithm. So, we need to combine data with this prior knowledge in order to make any progress. Now, we also know, and strictly from the recent developments in deep learning that the more data you have, the better. And in some sense, if you have lots of data, you can get away with a little bit of prior knowledge. Or conversely, if you're in a world where you have very limited data, then you need to complement that with a lot of prior knowledge, very strong assumptions about the problem you're trying to solve. Now, what's interesting is the meaning of this vertical axis. What do we mean by a lot of data? So, this is a really important point. I want to talk about big data and what we mean by the size of a data set because there are two completely different meanings to the size of the data set. It is very important not to get them confused. There's the computational size, which is just how many bytes does it take up on disk. And there's the statistical size, which relates to its information content. So, we illustrate this with a couple of, sort of corner cases. So, the first example, imagine we have a block of metal. We apply a voltage and the current flows through the block of metal, and we're going to measure how much current flows when we apply a particular voltage. And we've got seven measurements here. As we've applied seven different voltages, we've measured the corresponding values of current and our goal is to generalize. This is a machine learning problem, and so, our goal is to predict the current for some new value of voltage on which we haven't yet made a measurement. Now, this case some kind and friendly physicists has come along and told us about Ohm's Law. An Ohm's Law just says, current is proportion to voltage. It's a straight line through the origin. The only thing we have to learn is the slope. The data points I have shown have measurement errors. These are real-world measurements. They're a little bit noisy. If they weren't noisy, one data point will determine the slope exactly. But the data points are noisy and there's only a finite number of them. And so, we don't know the slope exactly. But if we've got seven measurements, and the noise is not too high, we can be pretty confident about that slope. There's not very much uncertainty. This is a data set, which is computationally small because it's seven pairs of floating point numbers. So, computationally, it's a tiny data set, but statistically, it's very large data set. In other words, if I gave you another million measurements of currents and voltage, then your uncertainty on the slope will get a little bit smaller, but it's already very small. So, the next billion data points are not going to make a lot of difference. You're already in the large data regime from a statistical point of view. Think about another corner case. Imagine, we're going to have some images. I'm going to label the images according to the object that they contain. So, it might be airplane, car, and so on. And these images might have millions of pixels that are occupying many megabytes each on your disk. And we might have a billion images of each class, a billion examples of airplanes and a billion example of bicycles. So, this is going to take up a huge amount of disk space. So, this is a data set, which is computationally very large, big data, in the usual sense. But what about statistically? Well let's imagine, I'm naive and I just treat these images as vectors and feed them into my favorite, whatever neural network as a classifier. If you think about the airplane, the airplane could be anywhere in the image, that's two degrees of freedom. Actually, it can be any distance as well. So, three degrees of freedom of translation, three degrees of freedom of rotation. Your planes come in different colors, different shapes, different illuminations, but all of these degrees of freedom can be taken together combinatorically. So, if I showed you one image a second, and you'll all agree that every image was an airplane, how long before I run out of images? Well the answer is, far longer than the age of the universe. I mean the number of images that we all agree are airplanes is vast compared to the number of electrons in the universe. So, if you just have a very naive approach to classifying these objects, then even a billion images of each class is a tiny, tiny data set. So, it is computationally large data set that is statistically small. I'll just go back for a second to the previous picture. This refers not to the computational size of the data, but to the statistical size. So, that's the concept of prior knowledge in the data and the concept of the size of the data set. So, coming back to this problem then of which algorithm am I going to use? How am I going to address this problem of just thousands, of thousands of different algorithms? So, I want to introduce you to the philosophy, if you like, with Model-Based Machine Learning. But it's a very practical philosophy. So, the idea of this is not to have to learn every algorithm there is. It is not to try out every algorithm and empirically see which works best. The dream of Model-Based Machine Learning is instead to derive the appropriate machine learning algorithm for your problem. Essentially, by making this prior knowledge explicit, which I will show you how that works in a minute. So, traditionally we say, How do I map my problem onto one of the standard algorithms? And often, that's not clear and so, typically, people will try out lots of different things, they try decision trees and nets, and small vector machines, and so on. Instead, in the model-based view, we say, What is the model that represents my problem? What is the model that captures my prior knowledge? And so, by forcing ourselves to make these prior assumptions explicit, we have a compass to guide us to the correct algorithm or these sets of algorithms. So, the idea is the Machine Learning Algorithm is no longer the first class citizen. Instead, it's the model. It's the set of assumptions, and there are set of assumptions that are specific to the problem you're trying to solve. So, if your problem, you'll have one set of a problem, set of assumptions, you'll have a different set of assumptions, you will arrive at different algorithms. And that's why there's no such thing as a universal algorithm. The algorithm that it's tuned to the particular problem we're trying to solve and that's reflected in this domain knowledge, these assumptions as prior knowledge. So, we take the model, the prior knowledge, we combine it with an inference method. The inference methods tend to be fairly generic. So, the inference methods with things like, gradient descent. That's in your net. So, expectation of propagation if we're looking at graphical models. General techniques for optimizing or computing the posterior distribution of parameters of a model and together they define the machine learning algorithm. So, the dream is, you write down explicitly your assumptions. You choose an appropriate inference method and then you derive the machine learning algorithm. And when you apply it to your problem, it will be widely successful. So, that's the dream. Now, we're not entirely there yet. But I'll show you some great examples. Let's talk a little bit about the assumptions that go into models. If you will get deep neural net, you'll think, well, they're not making any assumptions, they're just generic universal machine learning algorithms, you pour data you wanted, the magic comes out the other. So, where are the assumptions in the neural net? So, let's look at, if you like the simplest neural net algorithm I suppose it's logistic regression. This is making a very, very strong assumption. This is a lot of prior knowledge. That prior knowledge histogram, that's sort of it's high, because it's restricting us to very, very narrow domains. It's making very strong assumptions. It's saying that the prediction Y is some linear combination of the inputs passed through some simple model nonlinearity. That's a very, very specific model, and if we have multiple outputs at the same time, then we arrive at a single layer, neural net. That's like lots of logistic regressions happening and all at the same time, and that of course was the type of model that people were excited about in the first wave of neural nets in the days of the Perceptron and so on. The second wave of excitement of neural nets in the late 1980s, early 1990s, was when back-propagation came along, we learned to train two-layer nets, in which these features themselves could be learned from data, and it was a very exciting time. I actually made a crazy decision, which was to abandon what was a very successful career in physics because I'd read Geoff Hinton's paper on backprop and I thought, "Wow, machines that can learn artificial intelligence. This is the future." And I gave up my career. I persuaded my boss to buy me a computer. I taught myself to program, I have never done that before, got some C code and started hacking away. So, that was the second phase of excitement around neural nets. Then of course, they went away again. They didn't really go away, they became rather niche. People moved on to other things. The support vector machines were very popular for quite a while. And then along came deep learning where we learned how to train many, many layers, that's deep learning. By the way, the story I heard from Geoff, I don't think you mind me telling you this, that he was very fed up with the course because he discovered back-propagation with colleagues, and that they've been quite successful, but then they were overshadowed by the support vector machines, which is kind of a funny sort of approach to machine learning in a way. So, when he finally got neural nets to work properly, he decided to call them deep learning because that allowed him to call support vector machines shallow. That's the real reason. So, what prior knowledge is built into this? Well, again, there's a lot of prior knowledge. It says the output is determined by this hierarchy of processing. So, let's take a probe. Let's imagine I'm going to take a photograph. I'm going to classify that image as either happy or sad. Now, what does the computer see? The computer sees pixels. So, how does deep neural networks solve it? Well, the deep neural net, the first layer, what it's doing is looking for things like contrast, dark regions next to light regions, and the next layer combine those local contrast detectors to detect rows of pixels in the image, in which if you have an edge, a dark region separate from a light region, maybe the next layer looks where edges end or where they change direction. So, it looks for things like corners, and a little bit further up, the corners get combined together to make shapes, things like faces, perhaps expressions on faces, objects that you see in the image. Maybe the next layer up, it's looking at the relationships between objects. Maybe there's a birthday cake, maybe there are candles, maybe there are people, maybe the people have smiles on their faces, maybe at this point, you've got a lot of evidence that this is a happy image. Our brains are like that too. They have this layer of processing. They have sent a surround response oriented edge detectors and so on. And when we train artificial systems, we find similar structures in the layers of visual processing that we find in the brain. So, there's one very strong piece of prior knowledge built-in, which is this hierarchical processing that seems to be very effective. So, what's really going on, the reason deep learning is working so effectively, in a way of saying this, is that there are lots of problems in the world including image processing example I just gave you, where this hierarchical structure seems to work well on real applications, or put it another way, the prior knowledge that builds into these deep networks resonates well with the kinds of problems we're trying to solve using these networks. Something to say a little bit about the data and prior knowledge, and we look at some of the other assumptions that are built into neural nets. So, let's imagine now that I've got a set of images. My goal is to classify the images according to whether they contain a person or they don't. So, here's an image, and this image does contain a person, and what we know is that that classification does not depend on where in the image the person is located. So, these are all examples of images that contain a person. Now, in terms of the vector of pixels, they're all very different, but they all belong to this class. If I want to build a system that can detect a person irrespective of where the person is in the image, then one way to do it is to go and collect huge numbers of images with people in all possible locations, and then the system will learn they're all examples of people. The challenge there of course is a bit like that airplane, this very high dimensional space, the airplane example. I need many, many examples, lots of examples of images just to capture this notion that the classification doesn't depend on location. So, a very sort of wasteful of data. Another way of doing it is to generate synthetic data. So, maybe I don't have data of people in lots of locations, but maybe I've got just one image of a person in one location. I can create synthetic data in which the person is moved around into different positions. So, that's another way of building prior knowledge, not building it into the model, but effectively augmenting the data, and that's quite commonly used. Again, that was quite wasteful because I have to replicate the datasets, we end up with a computationally large dataset. It would be much smarter if we could just give it one example of a person and then in the model, bake into it the prior knowledge that the output doesn't depend upon location. We call that translation invariance. The way we do that in neural nets is through convolutional neural networks. So, this is the input image, and we have a convolutional layer. In the convolutional layer, each node looks to the small patch of the image. The node next to it looks at the next small patch, and then the weight between the blue node and the red node is shared. So, they can adapt during training that they're always in lockstep. So, whatever this blue node learns to detect, the red node will detect exactly the same thing but moved slightly because that's the convolutional layer. That's followed by sub-sampling layer. Again, this node looks at a small patch on the convolutional layer, and it might do something like take the max. So, imagine there's something in this image which causes the blue node to respond, and that causes this node to respond. Now, we move it slightly. Now, instead the red node will respond, but again, because we're doing something like a max, this will still respond. So, this now exhibits translation invariance. It responds even though the image just moved slightly. Now, what we do in practice is we repeat this many times, we alternate but in another convolutional layer, another sub-sampling layer, and it's sub-sampling because the resolution of this is lower than that. Eventually, when we get to the output, we have a few outputs, and we have translation invariance. So, if we moved things around, the output stays the same. This actually encodes a sort of more general kind of translation invariance because imagine part of the input is translated and not the other part. Again, the output will be invariant. So, it's exhibiting sort of local translation invariance. Think of a rubber sheet defamation. Imagine I've got that birthday party, and some of the people moved around and the birthday cake stays where it is, it's still a happy scene. So, we've got sort of local as well as global translation invariance. You can see this is not a universal black box. This has got a lot of strong prior knowledge baked into the structure of the network. And if you don't have those sorts of structures, good luck with classifying airplanes and all the rest of it. You're back in that exponential space again. >> Okay. So summarized we've got to so far then, we've talked about the fact that there isn't a machine, universal machine learning algorithm. That the goal is to find an algorithm that's good on the particular dataset that we have,. That depends upon combining the data with prior knowledge. And the dream is that by being explicit about the prior knowledge, combining with an inference algorithm, we'll discover the machine learning algorithm, instead of having to read 50,000 newspapers implementable and compatible. I want you to choose another concept now in machine learning. So, machine learning, as you know, this particular, the sort of breakthrough of deep learning has generated this tremendous hype and excitement around artificial intelligence. Now, artificial intelligence, the aspiration goes back certainly to Alan Turing seven decades ago. And the goal is to produce machines that have all of the cognitive capabilities of the human brain. It's a great aspiration, and we're a very long way from achieving it. We've taken a tiny step towards it with the recent developments of machine learning. So does that mean that all this hype about artificial intelligence, all the excitement and the billions of dollars investment is all a waste of time because it's all decades and decades away? In my view, no. In my view, all of the excitement around machine learning is totally justified but not because we're on the brink of artificial intelligence, we maybe, we maybe not. Maybe it's centuries away, or maybe it's next year, I have no idea, but there is something happening, which is revolutionary. It's transformational. And it's the transformation in the way we create software, and we're not really talking about the development process. I'm talking about the fact that ever since Ada Lovelace programmed the Analytical Engine for Charles Babbage, she had to specify exactly what every brass, gear wheel did step by step. And software developers do the same thing today. It's a cottage industry in which the developer tells the computer exactly what to do step by step. Now, today, of course, the developer doesn't have to program every transistor. They'll call in some API which evokes a million lines of code written by other developers, and there are compilers and data machine code and all the rest. So they're very effective, very productive compared to Ada Lovelace in terms of their efficiency, their productivity. But, fundamentally, we're still telling the computer how to solve the problems step by step. Now, machine learning, we're doing something radically different. Instead, we're programming the computer to learn from experience, and then we're training it with data. The software we write is totally different. The software we write often has a lot of commonalities. So we'd use neural nets to solve speech recognition problems, communication problems, and so on, adapted each time, according to the prior knowledge of our domain, of course. But we're doing something radically different. I think this is a transformation in the nature of software, which is every bit is profound as the development of photolithography. Photolithography was a singular moment in the history of hardware. Ever since the days of Charles Babbage and gear wheels, vacuum tubes, transistors, logic gates, computer hardware has been getting faster and cheaper. And then we discovered how to print large scale integrated circuits using photolithography. And with that, a transformation because it went exponential. That's Moore's Law. We're going to print circuits. And, now, the number of transitional circuit doubles every 18 months. And as they get smaller, they get faster. Amazing things happen. That's why we have the tech intercepts. Why we're all carrying supercomputers around in our pockets? Because of photolithography, because of Moore's Law. Something interesting may be happening in software because the way we're creating these solutions is by program the computer to learn from experience and then training it using data. When I see a Moore's law of data, the amount of data in the world is doubling every maybe year or two. And so we are on the brink of something tremendously exciting and all pervasive through machine learning. That's real, that's happening right now. One of the things that it might lead to is artificial intelligence. But even if it doesn't, or even if artificial intelligence is decades away, this is going to transform every aspect of our lives. One of the areas that I'm hoping it'll transform is health care and that's a personal interest of mine, but it will be all pervasive. And I do think it's transformational. I've got the yin and yang diagram because I think there's a kind of flipside of learning from data, which is quantifying uncertainty. So, again, go back to traditional computer science. It's all about logic. It's all about zeroes and ones. Everything is binary. The engineers at Intel and ARM were really hard to make sure every transistor is unambiguously on or off. We're in the world of learning from data. We're in the world of uncertainty. We have to deal with ambiguity, so uncertainty is everywhere. Which movie does the user want to watch? Which word did they write? What did they say? Which web page are they trying to find? Which link will they click on? Which gesture are they making? What's the prognosis for this patient? And so on. In all cases, we never have a definitive answer, whatever certain, which link the users going to click on. But they may be much more likely to click on one link than another, and we can compute that likelihood using machine learning. Uncertainty is also a heart of machine learning. So there's a transformation from logic to thinking about uncertainty. Of course, you all know there's a calculus of uncertainty, which is probability is, again, there are mathematical theorems, which show that if you're a rational person and you quantify uncertainty, you will do so using the rules of probability or something that's mathematically equivalent to them. So, again, that's a mathematical foundation that's laid a long time ago. That's not going to change. This we're thinking about just very briefly two perspectives on probability. What do we mean by probability? Well, when we're in school, we usually learn a little bit about probabilities. We learned the frequentist view, the limits of any infinite number of trials, a frequency, interpretation of probability. But I'm sure many of you know there's a much broader interpretation which is probability is a quantification of uncertainty, and that's the Bayesian perspective. It's almost unfortunate that both are called probability, but the mathematical discovery is that if you quantify uncertainty using real numbers, those numbers behave exactly the same way as the frequencies with which dice throws behave. And so we called it probability. The fact we use the same name for both, I think it's going to flow a confusion over the years. Let me just give you a little example. Hopefully, this will shed some light on this. So imagine we've got a coin, and the coin is bent. The coin is not equally likely to land hit at one side of the other. Well, imagine, if I flip the coin, there is a 60 percent probability it will land concave side up, and a 40 percent probability it will land concave side down. Let's just imagine that's the physics of this particular bent coin. What do we mean by 60 percent probability? We mean if we flip it many times and compute the fraction of times that lands concave side up, as we go to the limit of an infinite number of trials, that fraction, which will be sort of a noisy thing, it will settle down a little asymptote to some number, and that number will be.6. That's the frequentist view of probabilities. Now, let's suppose that one side of this coin is heads, the other side is tails. But imagine you don't know which it is. All you know is that the coin is bent, and there's a 60 percent probability of landing concave side up. Okay. So, Victor, I'm going to make a big bet with you with a thousand dollars about whether the next coin flips are going to be heads or tails. Now, you're a very rational and very intelligent person. How are you going to bet? You're going to bet 50-50. It's sort of obvious, right? It's symmetry. Victor doesn't believe that if we repeat the experiment many, many times, that half the time, it will be heads up and half the time, it will be heads down. What he believes is that it could either be 60 percent heads, or it will be 40 percent heads. You see, we are flipping the same coin each time, but we don't know which it is. So the frequency with which it lands concave side up, it's like a frequentist probability, but uncertainty about whether the next coin flip is going to be heads or tails is like a Bayesian probability. And so imagine I've got this bent coin behind the desk here, and I'm flipping the coin. And I'm honest and truthful, and I'm telling you whether it's heads or tails. The more data you collect, the more you can discover about whether heads is on the concave side or heads is on the convex side. As you collect data, you're uncertainty about whether the head's is concave or convex, that uncertainty gradually reduces. And then the limit with the infinite number of trials, there's no uncertainty left at all, you're completely certain about which is concave and whether the heads is on the concave side or the convex side. You still don't know whether the next coin flip is going to be heads or tails. But let's say you're certain that the heads is on the convex side, and you know this is a 60 percent probability, the next split will be heads. I hope that illustrates the difference between Bayesian and frequentist probabilities. That's the simplest example I can think of. At this point, you might be thinking, why am I making so much fuss about this? Because I've said that in traditional computing, everything is zero or one. And now everything is going to be described by probabilities which lie between zero and one, and it seems like a tiny change. It seems like just a little tweak. So, this is an example, this is my illustration of why it's not a little tweak, why it's a profound difference. So imagine, here's a bus. And let's suppose the bus is longer than the car. And we'll suppose that the car is longer than the bicycle. Okay. Now again, I know you're all smart people. So, if I say, the bus is longer than the car and the car is longer than the bicycle, do you all agree that the bus must be longer than the bicycle? Okay. If anybody doesn't agree, go back to the beginning of the class or something. That's a very well known property. We call it transitivity. And here's the amazing thing. When we go to the world of probabilities and uncertainty, transitivity need no longer apply. And there's a really simple example of it. And it's these things. These are called Efron dice or nontransitive dice. And they're standardized except they have unusual choices of numbers. And let's say, again, we're determined to get some money out of Victor, so, I'm going to make a bet, that we're gonna have a game of dice. So, we're going to roll the dice 11 times an odd number and whoever gets the greatest numbers of wins is going to get the money. Well, it turns out that the orange die will beat the red die, two-thirds of the time. So, two-thirds of the time, the orange number will be bigger than the red number. Big deal. If I play the orange against blue, two-thirds of the time, blue will give a bigger number than orange. two-thirds of the time, green will give a bigger number than blue. And now, here's the amazing thing. The bicycle is also longer than the bus, because two-thirds of the time, green will give a bigger number than red. Now, if that isn't counter-intuitive, I don't know what is. It's bizarre, right? It's extraordinary and it's just a consequence of the fact that these are uncertain numbers, they're stochastic numbers. And the way it works, it's actually very simple. So, these are the numbers on the different die. So, the orange one actually always rolls a three as it happens. On the red one, two-thirds of the time, you get a two, and one-third of the time, you get a six. So, it's obvious that in two-thirds of the time, orange gives you a bigger number than red. And I'll leave it as an exercise for you to check the others. So, occasionally, in my copious spare time, I sometimes go and give talks in schools that sort of try and inspire the next generation with excitement of Machine Learning, Artificial Intelligence, Computer Science. We actually hand out packs of these dice to the kids. And if you go to that link, you can actually read a little bit more about it and you can see those numbers and check for yourself this is real. So again, I think this is quite a profound shift from the world of logic and determinism to, if you like, the real-world of uncertainty. At this point, I was going to show a demo and sadly, I can't show you the demo. So, in fact, I'm just going to skip over this. The demo was simply an example of Machine Learning in operation where the machine learns about my preferences for movies. And it actually does so in real-time. So, as I rate movies as like or dislike, it's uncertainty about which movies are like gradually reduces. So, what you're seeing in the demo is really if you like the modern view, I like to call it the modern view of machine learning, not machine learning as tuning up parameters by some optimization process, but instead, machine learning in the sense that the machine has a model of the world. In this case, a very simple world, the world of movies that I like or don't like, it has uncertainty about the world, expressed as probabilities. And as it collects data, that uncertainty reduces, because it's learned something, rather like the coin flip example. And we can think of all of machine learning from that perspective. What I'm going to do now is give you a tutorial in about one slide on a favorite subject of mine, Probabilistic Graphical models. Because I'm going to show you how we're taking steps towards realizing that dream of model-based Machine Learning. Not just as a philosophy of Machine Learning, not just as a compass to guide you through this complex space, but even as a practical tool that we can use in real-world applications. And to do this, I'm just going to need to give you a very quick tutorial on graphical models. If you know about graphical models already, this will be very boring, and if you don't know about graphical models already, you're going to learn too much, but at least you'll get a sense of it. So imagine, I've got two boxes, one of them is green, one of them is blue. And I'm going to pick one of these boxes at random, but not necessarily with a 50, 50 probability. It might be 60, 40 or something. And then we're going to describe that by a graphical notation. And this graphical notation, I have a circle representing this uncertain quantity. So, it's the value jar. So jar is a binary variable that's either green or blue, but it's not a regular variable. It's not either green or it's blue or it's none or something, it has a probability of being green or blue, it's an uncertain variable. And this little box just describes that probability. Now, imagine, that the boxes contain cookies, biscuits, as we say. These biscuits are either circular or triangular. And the proportion of biscuits is different in each box. So, I can now say, supposing I go to the green box, the green jar, and I pull out a cookie without looking, then there's a one-third probability that it'll be triangular and two-thirds that it will be circular. If I go to the blue jar instead, there's a one-third probably it will be circular and two-thirds it will be triangular. Okay? So again, there's some uncertainty. If I draw a cookie out of the jar, we're uncertain about which it is. But we know something, we know this probability. And so, cookie, again, is an uncertain variable that's either triangle or circle. It has some probability, but the value of that probability, depends upon the value of this random variable jar. So, we can think of this model in what we call a generative way in which I do an experiment, I, first of all, randomly, choose a jar, and then given that jar, I dip in and I randomly choose a cookie, and that tells me the value of jar and consequently the value of cookie. That's a forward model and that generates data, generates jars, it generates cookies. And I could repeat that many times. Now, in real applications, typically, in this graph, of course, is describing my prior knowledge about the world. I know the world consists of jars and it consists of cookies and they relate to each other in certain ways. So this graph is a very visual way of expressing that prior knowledge which is obviously critical in as we've seen in Machine Learning. Typically, what we do though with these graphs is we observe something, in this case, you might observe cookie. Or we want to go the other way, we want to work out which jar did that cookie come from. So, maybe there's a 60 percent chance that it's green. So, it's more likely to be green. But now, when I observe the cookie, I observe that the cookie is triangular. Now, your intuition says, if it's triangular, it's more likely that it came from blue than green. And that's correct. So when you run the math, you just go base there and very simple, you'll find that the probability that it was jar, shifts a little bit towards blue. You're just as your intuition would expect. And so that, if you like, is the Machine Learning process. We've observed that I like a particular movie, and the internal state of the machine gets updated using sort of based theorem on steroids to say, I sure am a bit more likely to like action adventure than romantic comedy or whatever it might be. And that's a crash tutorial, but Chapter eight of this amazing book, I'm sure you will have, I hope. Chapter eight is a free PDF download and that's a whole chapter on graphical models. Okay. So, let me illustrate now. I'm going to pick a particular Machine Learning algorithm, it's called PCA or Principal Components Analysis and something everybody learns about in Machine Learning 101. And first of all, we're going to describe PCA the way you'd normally learn about it from a textbook. And then, I'm going to show you how to derive PCA using the model-based perspective, and we'll use those graphical models. So, PCA as an algorithm, it's like a recipe. It's a recipe that you apply to data. First of all, it says, take the data. So, the data will be vectors in some high dimensional space. And there are n of them and if it says, first of all, average those vectors to compute the mean, then subtract the mean of all of those vectors and compute this thing which is the sample co-variance matrix, then find the eigenvalues and eigenvectors of the sample co-variance matrix, and then keep the eigenvectors corresponded to the M largest eigenvalues. That in some sense is compressed the data or projected it down onto an M dimensional subspace in a way that preserves variance. So, that's Principal Components as a recipe and you can code that out, turn the handle, out would come the answer, and you'd have no idea why did you pick that recipe. Maybe it works brilliantly when your'e done, what if it doesn't work well? What are you going to do now? How are you going to change the recipe so that it works better? So, if you have no compass, you're just left with random trial and error. So, here's a much better way of thinking about things. So this is PCA viewed as a model. So, in the same way that we're going to pick a jar, and then choose a cookie from the jar, I'm going to describe to you how to generate the data. Because one way of capturing your prior knowledge is to write down how the data gets generated. So, in this case, it says pick a vector from a lower dimensional subspace, from a Gaussian distribution having zero mean unit variance, circular Gaussian distribution. So, pick a vector from that Gaussian distribution, then project it into this high dimensional space with some linear transformation, the space of your data, and then finally, generate a data point by taking that projected point. Making up the center of a Gaussian distribution, another Gaussian distribution that represents the noise and pick a sample from that. And so, don't worry about the details. It's just a description of how to choose one of the jars and reach and then pick a cookie. So choose the low dimensional vector and then generate the high dimensional vector by adding noise. Another little notation, it's called a Plate. It says just repeat that process n times. So, it says, put the cookie back in the jar, give it a good shake, close your eyes again, pick a jar, pick a cookie from the jar, do it n times. >> So, that generative process describes how the data gets generated is a great way to express our prior knowledge. But when we do machine learning, we're trying to solve an inverse problem. We have to go back the other way, which is much harder. So, we observe the data, and we have to make inferences about the points in the lower dimensional space, and also the values of the parameters of this linear transformation. And so, we have to run inference. And then again, it's a mathematical proof that this is identical. If you use what's called maximum likelihood to deal inference, that is say if you choose all the parameters to maximize the probability of the data onto the model, you exactly recover PCA. Now, this point, you think 'ah, it's a lot of work just to get back to PCA.' So there's completely equivalent. So, why is the model base view so much better? The reason is that if this doesn't do what you wanted to do, you can go back and examine those assumptions. And you can change the assumptions to better reflect the problem you're trying to solve, and then, rederive the model. You haven't just got a recipe, you've got a procedure for arriving at the best model for your problem. So, just take a simple example, supposing that these drawings, these generated data points, so not generated independently. So, for example, let's imagine I'm air traffic control, and I want to know where the aeroplane is. The aeroplane is flying across the sky. And once a second, my radar is going to send that some energy. It's going to bounce off the aeroplane, come back and I receive it, and I make a measurement of where the aeroplane is. Now, the problem is that that measurement is noisy. So, if I just make a single measurement, I'll know where the aeroplane is roughly, but there'll be some uncertainty. Now, we know that if that's just random noise, if I make multiple measurements, I can sort of average out the noise, and get a more certain estimate of where that aeroplane is. So, we've going to make several measurements. The problem is, the aeroplane is moving. As I make these measurements, it's moving. If I just average the measurements, that will be great, because I'll average out the noise. But I'll also average out the location, which is what I'm trying to find. So that's bad news. If I don't average, if I just use the latest measurement, I won't be averaging over the motion, but I have a lot of noise. So what should I do? Well, you could sort of have some intuition. You could say, "Hmm, I should take the latest measurement because that's where the aeroplane is, but I'll add in a bit of the previous measurements to get rid of some of the noise, maybe a little bit of the measurement for." But the measurement from 10 minutes ago is irrelevant. So, have some sort of a weighted average, or I give more weight to the more recent measurements. That's sort of your intuition. Actually, that intuition turns out to be good. That's actually what you should do. But how much weight should you give? What sort of functions should you use for this decay? How much should you decay by? How do you know what to do? Your back in the world of recipes, intuition, trial, and error. So instead of that, let's build a model, in which we are very explicit, about all the assumptions we're going to make, because that's more likely to work better. And if it doesn't, we know how to change things to improve it. So, we're going to say that here's the actual position of the aeroplane in space. Oh, sorry. This is the actual position of the aeroplane in space. I think we want to know. We don't know it. It's unknown. So, the aeroplane is in some position. And then, we make a measurement, the measurement is noisy. So this is the noise process, but we know that's value. That's the thing we observe. This is the observed position, which are noisy measurement of the true position. Given that alone, we could estimate this but have a lot of uncertainty. What's going to happen now is the aeroplane is moving across the sky. We could build a model for that. And the simplest model that we can have is to assume that the uncertainty and the position of the aeroplane is Gaussian, that the measurement noise is Gaussian, and that the movement of the aeroplane across the sky is described by linear model. So, given its position and its velocity, we can compute where it will be at the next timestep. Now, again, we make another measurement, another noisy measurement of that next timestep. Now, the aeroplane moves a little bit further and we make another measurements and so on. So, that's the generative process. But now, what we need to do is to run inference. Given these observations, we need to compute. We need to revise the probabilities of these aeroplane locations. So, we cannot just that's sort of Bayes' theorem. It's a more complicated version to Bayes' theorem. And it turns out that that problem can be solved in a very elegant way computationally by passing messages around the graph. So, we don't have time to go into that. It's the very beautiful mathematical solution called message parsing. It's very generic. But this thing turns out to have a name. It's called the Kalman filter. It's been around since the 50s or whatever. It's very standard stuff electrical engineering. When I was writing my 2006 textbook, I had a chapter on these times series model, and I read several books called Kalman filters, introduction to Kalman filters. I find it pretty impenetrable, and it is very complicated, many many chapters where you finally get to all of this stuff. This is, by far, the simplest way of deriving the Kalman filter that I know, just derive message parsing and know its generality, and apply it to this linear Gaussian model. And you get the Kalman filter equations, in which you say, the posterior probability of this, the position of the aeroplane at this time depends upon all of the measurements. This is more sensitive to the current and so, the recent measurements. And so, you do get that decay of the weighting the evidence but in a very precise way that you derive from the mathematics. And you can even pass measures the other direction and send information back in time, and get the better estimate of where the aeroplane was, but making as a future measurements. Again, it's your intuition would indicate. Guess what? If these are not Gaussian, but supposing they're discrete variables, again, you just pass measures back and forth. Now, it's called the hidden Markov model. Well, that's a completely different literature with completely different notation and completely different but equally impenetrable derivations of how all this goes. Again, it's just exactly the same model, just slightly different assumptions. And maybe this works quite well, maybe it doesn't work quite well enough, maybe there is some. So, you try this out on your problem, you find it's still not working quite well enough, you know what to do. It could be, maybe there's a problem with the data that you've collected, maybe there's a problem with the inference, because most inference algorithms are approximate. For the Kalman filter, it's exact. But once you get to more complex models, you always take approximate inference. And maybe your inference algorithm had some issues, or maybe your prior assumptions were not correct. Maybe you need to refine them for the problem you're trying to solve. You know how to do that because you made an explicit. So, maybe this noise isn't Gaussian. Maybe real radar. So you're going to talk to radar engineer and find out what the noise really is like and then, model it. And you get better results. Okay. So, I think it's more or less my final slide. What I've shown you so far is really a philosophy, a viewpoint of machine learning that I hope helps provide you with a compass to guide you through this complex morass of algorithms, but also a practical tool to use when you're building real world applications. But at the back of our minds, we have a dream. And the dream is that we can somehow automate this. We can provide tools so that people who haven't read all the textbooks on neural net, I mean, machine learning and so on. You need to buy the textbooks, by the way. You don't need to read them, just so everybody is clear about that, or one in particular anyway. But say you read all that stuff and learn all about this, can we automate it? Can we provide tools that will help democratize this approach to machine learning? And so, this is the dream. So, if you think about coding up inference for a complicated problem, like the movie recommender example, it's pretty complicated stuff, thousands of lines of code. It's written on machine learning experts. You know about the modeling, know about inference, know how to code up the inference in the context of those models. This is all complicated stuff. All written in C++ or whatever your favorite language is, compiled down to machine code, combined with the data, lots of compute happens and you get your predictions with uncertainty. What if instead, we could write a thing which we call a probabilistic program. So, probabilistic program is just a very short piece of code written in some appropriate language, which effectively describes what that probabilistic model, that graphical model describes. So, it will almost say pick one of the jars with this probability, and then, for that jar, pick a cookie with a certain probability; or the aeroplane is in this position in the sky, and one second later, it's moved to a new position, and I'm going to make measurements. The measurements have Gaussian noise or something. It's just a simple description and a few lines of code, maybe if we're lucky, tens of lines of code, that describes the generative process of the data, or describes it very clear intuitive form the prior knowledge that we're baking in to our model. And we are going to a piece of magic, which is a probabilistic program compiler, which is going to take this high level description and generate these 1,000 of lines of code automatically. So that's the dream. We haven't achieved the dream, but we have made a lot of progress. We've built a compiler. And if you go to infernet, you can download infer.net. And there's lots of tutorials and examples and so on. And infer.net doesn't cover every possible case, but it covers a lot of common cases. And for those cases which it is applicable, you do have this automation. And of course, the whole time, we are looking to extend it and generalize it. So, it is quite an exciting program for search. And so, we're going to leave you with this. This is the graphical model with the random variables, the probabilities, and the plates for the movie recommender problem, the problem of recommending movies. So here, we have uses. So, I'll stand back so I can read the writing. Okay. User bias feature weights. So, what we've got here are features about the user. It might be age, gender, geographic location, anything which might influence what movies are like. Here, we've got features of the items. So it might be the duration of the movie, the actors, whatever, perhaps genre, action, adventure, romantic, comedy and so on. And then, we also have in here information, which is we call collaborative filtering. So that's people who've like the movies you've liked so far also like these other movies, so perhaps, you'll like them. But not coded up as some sort of hacky piece of intuition, but just described by probabilistic model, a very precise probabilistic model. And so, this can be cast in a few dozen lines of infer.net code. And then, the inference algorithm can be compiled automatically. And so, right down here, we have the thing we observe. That's the ratings. That's somebody saying I like this movie or I don't or this movie has five stars, this is one star. Once we make observations from a user about which movies they like, we send information, we pass machines up this graph, revise probabilities for these hidden variables, send messages back down again, and we get revised probabilities which are ratings for movies the person hasn't yet seen. And so we update the probability I'm going to like some unseen movie based on ratings I've given to movies I have seen, plus all the ratings that thousands of other people have given to that movie and other movies. That's how that works. And again that's all recorded in infer.net. And so we leave you with another book, but the good news is this book is online, it's free. Will be forever more. It's called Model-Based Machine Learning. It's co-authored with John Winn and John has actually done overwhelmingly the bulk of the work on this book. So I'm very much the second author. It's really John's baby. This is a very unusual book. There's a little introductory chapter but there after, every chapter is a real world case study. We've chosen examples from Microsoft because that's what we know about, and these are things we've worked on. And in each case, we start with the problem. The problem we're trying to solve, we're trying to match different players on Xbox so they'll have a good game, in other words, they'll be similarly matched in strength. That's the problem we're trying to solve. We'll describe the data that we have, we'll describe the prior knowledge, the assumptions we're going to make. We derive the Machine Learning algorithm, we test it out from the data, we find it doesn't work very well, because that is what happens in practice. Anybody who's ever tried Machine Learning in the real world, the first thing you try generally doesn't work. And then we go back to debugging, was there a problem with the data that we collected? Was there a problem with the inference, the approximate inference algorithms we used? Or was it a problem with the assumptions that we made? And so we go and we revise the assumptions and then run it again and of course every chapter has a happy ending. We get good results and it ships and is used by millions of people. But it's a little bit more honest about the process by which we arrived at those solutions and it shows you how, I hope, for each of these examples, and they're drawn from very different domains, medical examples and so on, in each case, hopefully you can see how by making the assumptions explicit, that critical prior knowledge, by making it explicit, it gives you a compass to guide you through the process of revising and refining the solution and getting it to work properly. Otherwise you're left with a big space of trial and error and not knowing what to try next. So with that, thank you very much. >> Thank you Chris for that great talk. And we'll probably take a couple of questions and then, yeah, you have the mics? Okay, so this hand going up first maybe one mic for the gentleman here. >> Hello sir, I'm [inaudible]. Thank you for the very nice talk. So I have one question that, by restricting to the class of probabilistic models, are we losing something? What is your thought on that? Because, there are neural nets which are not probabilistic models. >> Yeah. Several thoughts on that. First of all, the probabilistic view of Machine Learning is a general one. So the qualification of uncertainty using probabilities is the only rational way to deal with uncertainty. In practice, we often can't deal with probabilities exactly, we generally have to make approximations. One extreme approximation is a point estimate. So we place some complicated distribution with a single value. That single value will be chosen in some way that might be maximum likelihood for example. So if you're taking a neural net and you're training by minimizing error which is a lot likelihood, a lot [inaudible] noise distribution, then you're approximating that probabilistic inference, may it'd be a very drastic approximation. And the bigger, the more complex the model, the more data you have, the more performant you need to be, typically, the more radical the approximations you have to make in order to get something that's tractable and sufficient to perform your application. So it is quite general. Generally speaking though, although you may not be able to maintain full probability distributions of all of the internal variables like you did in the movie example, so all the internal weights of the neural net. Nevertheless, the outputs almost invariably should be probabilities. So I would say as a rule, whenever you're making predictions, they should always be probabilistic predictions. One of the problems of Support Vector Machines they're just intrinsically with no probabilistic and there are ways of fixing it up afterwards. So when you make a probabilistic prediction, instead of saying this person has cancer or they don't, you say there's a 37 percent chance they have cancer. First of all, you can threshold it and it's back to a decision, but can do so much more. For example, maybe the cost of taking somebody who has cancer, misdiagnosing as not having cancer is much worse than say somebody who's healthy and diagnosing with cancer because in the first case they might die in the second case they might get upset and need some further tests. So that loss measure is are very asymmetric and if you've got probabilities up you can take that into account correctly. You can use probabilities to combine the outputs from multiple systems so like a Euro of uncertainty or a universal currency you can combine different systems. You can do things at threshold you can say, I'm going to make a decision when my confidence is above a certain level, if my confidence is below that level, I'm going to send off a human. So if you've got some very repetitive task medical screening where people staring down microscopes all day long, you might be able to help them by just taking 90 percent of the data and I'm very confident this is not cancerous but everything else we're going to leave to human judgment. That's a very practical thing to do today. So lots and lots of advantage of having probabilistic predictions and no downside. It's always, always your probabilities. Okay, Masam has a question then we'll take one more. So two more questions then we'll wrap. I'm Masam from IoT Delhi. Thank you for the very nice talk. I learnt my AI in the early 2000s, and that was the time probabilistic graphical models were at the peak, and I'm a application researcher, I work in Natural Language processing and I remember conferences where pretty much all the papers except a very few were all probabilistic graphical models based sometimes at some point it became LDM based and so on and so forth. Of course, there's a new world order, and so I find very few papers in the application area and I'm not talking about people who look in the 2D and the fundamentals of Machine Learning. There's a lot of work still going on in there and some unsupervised learning as well. But in the application domains, it is all neural networks left right and centre and Probabilistic Graphical models are either not being tried or have been overtaken and life has changed. So I want to understand your perspective in the future as you know in the time to come, what do you see as the role of Probabilistic Graphical models based solutions in application areas. Do you believe that they will still have a strong role to play or do you believe that they will be overtaken by neural networks? If they will have a role to play would it be in conjunction with neural? What is the value it will offer when it's using a BGM solution the right solution to approach? >> Sure. So you've got to understand that Machine Learning like everything else is a social enterprise. All right. So let's take neural networks. There was tremendous excitement in 1960s around positrons because machines could learn and you could cut 10 percent of the wires and it carried on working just not quite as well just like the brain. So tremendous excitement. Then all went away again, and then all came back again in the 1980s, 1990s. Neural nets were the solution to everything. Then it all went away again and then all came back and. So right now it's all- And in the application domain, we've got these particular techniques, certain classes of covolution nets and LCMs, and a handful of things. So working very well on certain problems for which we can get lots of data we can label up by hand. Many, many practical applications. So it's unsurprising that this tremendous focus of applications is bearing down on this one set up. We've discovered this new technique and everyone applying it in all kinds of places. That's unsurprising. If you step back and look at the field of Machine Learning, it's a very broad field and this discriminative training based on hand level data was one tiny corner, which has all kinds of, I think you know the last speaker covered some of these, there are so many limitations that are scratching the surface of what we want to do with machine learning. The whole world of reinforced learning, unsupervised learning, that [inaudible] and somebody mentioned the work it burned choked and others, they're all issues about bias and in learning. Think of the world of Machine Learning as this enormous opportunity that's out there in front of us, and then right now there's a whole bunch of people, for understandable and good reasons, focused on music and particular technical applications. So first of all, probabilities are the foundation, there's a mathematical theorem that says if you're behaving rationally and you're not certain, you're going to use probabilities or something equivalent. So it's not going to go away. I don't th ink the Maths going to change. The graphical models are just a very beautiful notation. Personally, I find a picture is worth a thousand equations and it's just much easier to look at a picture and see what it's saying than pages and pages of Maths. So I don't think the pictures are going to go away any time soon. But your question is really about practical applications, and there're so many applications we'll be working on applications, you'll see examples in the book. Where just throwing a neural network is not the right way to go. We're actually graphical models are the appropriate tool and technique to use. So I can't predict what the next wave is going to be, maybe reinforcement learning will dig in and get some real traction, everyone will lurch across and start applying reinforcement to everything. But in terms of the field Machine learning, what an amazing time to be going into the field. We're just at the beginning of this. My son is at university doing Computer Science. He's interested in Machine Learning. I think well, that's great. There's a whole career to be built in this because we're just at the beginning of this. >>But just a tiny follow up. So you said that you tried neural networks in some applications that didn't work, I'm really happy to hear that but can you characterize what kinds of settings do you expect neural networks to not do well where PDM would be the solution in unsupervised scenario? >> Sure. Just an example would be the skill matching example in Xbox. So again, it's a chapter in the book. Where what are your assumptions? When you've got some players and they have some skill and you have some uncertainty in their skills which we described actually by the simplest possible [inaudible] as the Gaussian distribution. And then they play against each other, and I know you have some model for how their performance varies because the stronger player will sometimes lose to the weaker player because they didn't play too well in that particular game. And that's how we model all of that. And in fact, actually if you take that model and look at just the maximum likelihood limit, we throw away the uncertainty you come up with something called elo, which is the standard method used in chess worldwide. So, that's a model which is appropriate to that particular application. So again, it all comes down to getting the sort of fundamental point of it all, is that there isn't such a thing as a universal algorithm. Again there's a mathematical theorem that proves that, it's about building the right kind of solution that's tailored to your problem. So, you'll see some examples of that in the book. >> Okay, we'll take one last question here. Second row yeah. >> Hi. This is [inaudible] from Ministry of Technology, Delhi. And, I'm a PhD student there. And it is really heartening to see you talking more about the Probabilistic Graphical Models. I work in Probabilistic Graphical Models and at times in these times, it becomes scary that there I'm working in the right area where the whole world is talking about Deep Learning, so it gives me a sense of security. First that a person like you is propagating that. Thanks for that. So, the question is, basically so you shift the onus from the algorithms to the Probabilistic Models and assumptions. But then, when you walk into the Probabilistic Graphical Models another question arises like how you choose. I think the same problem gets shifted to how you choose which approximate inference algorithm to choose from? Like if I work in structured prediction there are ample amount of approximate inference techniques, variational inference to MC MC. There is some understanding on that, but I think the problem has just shifted to what approximate inference algorithm you will use for the Probabilistic Graphical Model. That is the first question and the second is at a higher level. So you talk about that there is no single algorithm as such and you have to adapt, you have to see the problem, understand assumptions, and then see which algorithms work there. On the contrary philosophy of, if I understand it correctly, Pedro Domingos talk about the Master Algorithm which will work, I believe we're almost. So what are your thoughts on that? >> Okay yeah. First of all just I don't want to say the impression is Graphical Models over here and there's Neural Nets over there and you choose one or you choose the other. Deep Learning is the ability to train these deep hierarchical layered structures and you might describe your problem by a graphical model, but maybe one of those conditional probabilities is a Deep Neural Net. So these are not alternatives they're more like, I think of the Probabilistic framework in the graphical model as a way of, again it's more like a compass to guide your way around the world of Machine Learning. Deep Learning is a very powerful technique and it's cropping up in many, many different places it will be used a lot. So, I don't want to characterize them as alternatives, but I do like the Graphical Models as sort of a general framework for describing models. So, sorry the second part of the question was the. Yeah okay so. >> [inaudible]. >> I mean just the lack of time I've said very little about Approximate Inference. Again, that model based machine learning book guides you through some of the inference methods that we're using in that context. And again in real world applications you make approximations and those approximations, you know you might have a complicated multi-modal distribution, you might approximate by Gaussian which is uni-modal and you're losing some sort of uncertainty, some ambiguity there, and that may or may not be important. So part of the challenge is, when you don't get the results you need is diagnosing where the problem goes wrong. So making the bad assumptions, inappropriate assumptions is just one of the places. If somebody hands you rubbish data that isn't what it claims to be, then you can just get bad results even if your assumptions are correct. And the same thing with the Inference Algorithm, that's a whole very complex world in its own right. And in essence the goal of infer.net is to hide that from you. You can focus as the domain expert on your prior knowledge that you know because you're an expert in Medical Imaging because you're an oncologist, or whatever, you don't have to know anything about inference. And the ultimate treat of infer.net is the inference will be entirely automatic and we're not there yet, but we've made progress. >> Okay. I think we should wrap up for now because we all need that kind of [inaudible] time. I'll request [inaudible] to say a vote of thanks for Chris, yeah. Thank you very much. Thank you Chris.

Info

Channel: Microsoft Research

Views: 13,519

Rating: undefined out of 5

Keywords: microsoft research

Id: 8a7wBLg5Q8U

Channel Id: undefined

Length: 67min 29sec (4049 seconds)

Published: Mon Feb 26 2018