Professor Anna Choromanska : Welcome to the seminar series on modern artificial intelligence at the "NYU Tandon School of Engineering". As some of you may already know, this is series which we launched last year to bring together faculty students and researchers to discuss the most important trends in the world of AI, and our invited speakers are world-renowned experts.. whose research is making an immense impact on the development of new machine learning techniques and technologies. In the process they are helping to build a better, smarter, more connected world and we are proud to have them here. The talks are live streamed and viewed around the globe helping to spread to the world about the amazing work going on in the AI community. I would like to thank the Jelena Kovačević as as well as my own Department of Electrical and Computer Engineering and chair "Professor Ivan Selesnick" for supporting the series and graciously hosting our esteemed guests. I would like to thank the media team and my right hand "Raquel Thompson" for their hard work in preparing the seminar. I'm sure that many of you here know and enjoy Netflix, our speaker today Tony Jebara is director of machine learning at the company, in that capacity he has helped spur new research leading to nonlinear probabilistic and deep learning approaches resulting in a valuable rankings of movies and TV shows that speak to the taste of individual Netflix users. He is currently working on integrating causality and fairness into many of Netflix machine learning and personalization systems. Outside of Netflix, he is a professor only from Columbia University and has published more than 100 peer-reviewed papers in leading conferences and journals across machine learning, computer vision, social networks and recommendation, his work has been recognized with the best paper Awards from the International Conference in machine learning and from the pattern recognition society. He is the author of the book machine learning discriminative and generative and the recipient of an "NSF Career Award", as well as faculty awards from Google Yahoo and IBM. He has co-founded and advised multiple startup companies in the domain of artificial intelligence and served as a general chair and program chair for the International Conference on machine learning, In 2006 he co-founded the New York Academy of Science, Machine Learning Symposium and has served on its steering committee since then. On a personal note tony was my PhD advisor and the one who introduced me to the world of AI and machine learning, and today I pay tribute to this excellent man and celebrate his research and influence on my life and the life of other PhD students postdocs researchers and others whom he inspired. I know we are all eager to hear his talk.. "Machine Learning for Personalization". So without further ado, let's welcome him to the stage. Tony Jebara : Thank you everyone and thank you Anna for a very warm introduction and really excited to be here and see all the great work happening in your team and across NYU. So great ! Let me talk about machine learning and personalization and Netflix and one big theme we like to think about is... Netflix really isn't offering one product, It's offering hundreds of millions of products because each person gets a unique flavor of the product. We personalize everything about the experience and If you look at this homepage, this is one person's homepage, It is really tuned for them in many many different ways and all those ways really are powered by machine learning. So, for example, we rank all the movies and TV shows that we have available for you in a personalized way... so there, you know, ranked from 1 to n in a unique way for each member, there's a page generation algorithm which figures out how to layout your page and cluster the images and and the movies and the TV shows in sensible rows again for you, in rows that you think.. you can understand better because they are coherent. There's also a ways we do personalized promotion by saying here's a new title, It's just launched on the service.. we don't know very much about it, but we think this is something that you would be interested in as a promotional material as well. We also changed the way we show the images of each movie and TV show for you using machine learning.. because not everyone wants to see the same photo of the same actor or the same photo of the same action shot and so that's also changed. We also do search in a personalized machine learning way, we message users with let's say pop-up messages and notifications and push notifications in a personalized way. The marketing we show as well as personalized, so when we show someone.. a Facebook ad.. saying oh.. here's a new show coming up next week, that is also a personalized ad as well, and we do even other types of predictions that are very machine learning focused and personalized as well. So let's talk about one aspect of machine learning and personalization, which is ranking and another variant of it is collaborative ranking and the idea here is to just say what will you like? We're trying to predict.. of all the tens of thousands of TV shows and movies, which ones would you like the most and rank them and this is based off of the paper we published last year, but before we jump into the details of that paper, let's take a time trip back to 2006.. so if how many people remember this challenge? Okay... So yeah, a few folks, remember this.. this is now getting to be kind of ancient history. The idea there was.. there was a matrix that Netflix was sharing with the broader research community, which was a matrix of all users and all movies at the time. These are actually newer movies, but just for illustration purposes and we had the star rating of each user for each movie, and so some users really liked the movie and gave it five stars, some users didn't like a movie and gave it three and the idea was to predict hidden ratings that you couldn't see in this matrix. So it turns out star ratings were interesting, but they're aspirational and we decided to move away from just star ratings, and if you look now, they can't see star ratings on Netflix.. It's thumbs up, thumbs down and in fact, you really want to focus not on the star ratings.. but you want to focus on, what people actually watched and did they watch and play that movie or TV show? Because star ratings are more aspirational people give many stars to things that they think should be award worthy... but not necessarily things they want to watch, for example Citizen Kane gets five stars.. but no one wants to watch Citizen Kane. So instead.. we want to look at what people are really watching and use that to make a prediction and that's now a flavor of this matrix, where you can have binary values instead of star ratings... for each user in each movie and say they really watch this movie, and so classically in 2006 this was viewed as a matrix decomposition problem, you have this ratings matrix R and this community kind of quickly grew around collaborative ranking with matrix decomposition techniques and matrix factorization techniques and the idea was to take this matrix R and break it up as a product of two skinny matrices. So U and M will multiply and Create this matrix R. And so finding these two low rank matrices was the goal and that got us pretty far. But of course that's linear factorization and you can go beyond linear, by using techniques from deep learning and neural networks, and so a natural extension is to say let's not just go from.. let's say X which is 1 vector inside the ranking matrix to some low dimensional Z, which is a vector inside one of those skinny matrices with just one linear mapping. Let's go through multiple linear mappings and do a squash in with sigmoids each time, and so that's the nonlinear version of matrix factorization and then you reconstruct again from Z through these other kind of linear mappings with squashing functions to get back an approximate X, and so this is another way of doing matrix decomposition to low dimension and the goal is to minimize kind of sum squared error.. let's say on the reconstruction. But you pinch the dimensionality in the middle and go down to something lower dimensional, just like the matrix decomposition was pinching the dimension Down to something like 50 dimensions in the middle and that's called the code in the middle, summarizing this users view history, which might be tens of.. thousands of.. dimensions down to about 50 dimensions, just like the matrix decomposition was going from a 10,000 dimensional let's say view view history down to 50 dimensional skinny matrices, and one thing you can do is.. you can go from the squared error version over here to a probabilistic output and the reason is... you're trying to reconstruct the view history, you don't want to just say here's a point estimate, this is the reconstruction in the view history.. you want to put a distribution around that view history because the goal is to say well, I'm not gonna perfectly reconstruct the be history. I want to predict.. where the view history will really evolve to and a distribution will kind of give you uncertainty in what parts are let's say properly being captured and what parts are hard to predict and so you can move away from traditional autoencoders, to a technique called variational autoencoders, which was published in 2014 by max Welling and Diederik P. Kingma and that's putting a gaussian on the output and there's also a gaussian on the... on the input and you use a slightly different optimization approach, but it's fundamentally the same concept, nonlinear kind of compression of your input space, and we extended this by saying, let's not do a Gaussian on the output.. but let's do a multinomial distribution on the output and the idea there is, we're trying to predict what you're gonna watch next think of these as non-negative values, you can't negatively kind of recommend a movie and also your recommendation across all movies should sum to one. This kind of forces, let's say the amount of recommendation to not say oh recommendation, this person will watch anything and everything and put very high values everywhere, really we're saying there's a kind of limiting of the users resources and attention, and this was published last year in dub dub dub. But the idea is we put a multinomial on this output, and we put we keep a Gaussian on the middle layer and that changes the probabilistic formulation a little bit. But the idea is now with your input, with some zeros and ones in it. We're gonna fill in a guess here with fractional outputs... that are gonna go between 0 and 1 and sum to 1, which will put bets on other movies and TV shows and those are the ones we're gonna then rank and say that's the best one for you and then second one and so forth. We don't want to tell you the watch of stuff you've already watched in the past and so how well does this do, so we move from let's say linear models to Deep learning models... from deep learning to probabilistic deep learning and then from this kind of probably deep learning also proper kind of multinomial Outputs and that short us gains all along that kind of sequence of improvements and so here are two data sets This is the Netflix 2006 data set and this is the movie lens 20 million data set and you can see traditional techniques here like Wim f is weighted matrix factorization. That was my first slide This is the recall. This is the Normalized discounted cumulative gain these numbers you want them to be bigger and you can see how these classical techniques are performing but when you switch to let's say deterministic automata a nonlinear version you see improvements on the movie lens dataset and On the Netflix data set away from these linear models basically slim and math and so on or layer models but then if you also go to a variational technique and you do the proper multinomial output Then you see a further game and you see that that's outperforming the deterministic automata So going from linear to nonlinear buys you something then going from nonlinear to probabilistic Nonlinear also gives you further improvements And I'll skip the details of how to optimize this technique, but it's basically a variational method where you replace the likelihood with this evidence-based lower bound this elbow technique Great so this is kind of ranking the titles for you, right? but of course once you do the ranking This is a prediction and one thing we've learned on Netflix is great prediction is an always great option Ok And this is kind of a subtlety machine learning is great at predicting You give it inputs and outputs and it learns a mapping from X to Y But usually you don't want to just have accurate wise you want to take those Y's and do something with them There's an actual action that you will perform and so that action is really what you're trying to get You know to be influential and influential and so we learned this when we start messaging the users Ranking is fairly neutral because you're predicting. The user will watch next but you can imagine there's also causal aspects to ranking Let's say I really wanted to watch narcos tomorrow and that was my next most desired title if the Netflix page ranks narcos number one and it floods me with images and Trailers and says wash narcos. I might actually change my mind and say you know what, this is creepy I'm gonna I'm gonna not watch narcos now, so Prediction is valuable but actually intervention is what we really want to understand and in messaging this became really clear because when you send someone messages you're actually intervening and kind of bothering them in a way and Influencing their behavior even more directly than when you're just presenting a ranking when they actually turn on Netflix so here's What I mean by causal learning and what I'm hoping we can see machine learning you've off to in the future We're not just learning machine learning models to make good predictions. We're learning machine learning models to take actions in the real world And so the idea is we don't want to just be accurate predictors. We want to learn Causality and understand causal mechanisms in the real world, okay and so For example, here's a picture if we were a airline company we're pricing the flights for you know our customers and We look at a time series here and these are the number of Tickets we sold over time and these were the prices that we were selling those tickets at so we get this nice time series if we're an airline company and let's say you're a data scientist working for some airline if you planted this time series of Number of tickets sold and the prices of those tickets you would get this type of scatter plot Here are the prices in here the number of flights old And you'd notice is correlation. Oh, it looks like as prices increase the number of sales increase So should you run to your boss and say I'm gonna make us tons of money Let's just increase our prices and we'll see You know sales go up and I'm pretty sure your boss will be very disappointed if you did that and you did a very simple Input/output machine here. Learn this correlation and said now that's the action I want to take and it's because there is underlying Causality issues and confounders here and these types of things effect machine learning all the time But in this toy problem, we're already seeing them. So the idea is there is some hidden confounder For example, we're trying to predict how X moves why that's 99% of machine learning It gets much more complicated than 2d, but that's mostly what we're doing with Deep learning or any of these methods but there's other variables. We don't always know about for instance. There's holidays. There's Confounders like there's a conference happening in town these things are pushing demand up and down and Making people also more willing to pay so I'm much more willing to pay To travel during the holidays to see family and relatives and I'm willing to absorb that you know expensive ticket price it's not because the price went up that I'm Increasing my demand it's because I have to travel for holidays or for conferences And so there's a whole bunch of stuff happening like this in the real world Which is actually the cause of both the price and the demand going up and there's only a little bit of let's say Causality from price to demand and there's a lot of causality coming from these Confounder arrows over here on the right and so you have to actually model this properly or else you're learning the law the wrong relationship and in fact the relationship between x and y might not be Increasing in that direction it might actually be decreasing but because we didn't model all those relationships With our machine learning system we miss out on that real relationship and so we've learned about this in the hard way when we started doing messaging and When we message our members, we're trying to figure out who the message - what message to send them? Should we tell them about a series of unfortunate events or this is the homeward bound or some other cartoons they shall watch Which channel should we use to message? You should we send you push notification or should we send you an email or a text message? and Let's say all of those decisions are put into this vector X This is a input vector of all the possible decisions and all the features. We know about you And why is the resulting amount of streaming and minutes you watched? We're trying to learn a good explanation of between X&Y Right and you can use machine learning to do that? Unfortunately, there are hidden confounders and if you just do this you will learn the wrong relationship between all these variables about which message in which channel and who would have sent the message to and the amount of streaming that you will cause it to increase and So there are some simple confounders. We already have let's say in this real world problem like the time of week this effects.. How active everyone is today ? This is a confounder. People just are more active at certain times of the day, certain times of the week and they read more messages and they read more emails and they also watch more Netflix, just because that's when they're not sleeping or they're not at work and they're able to do those things. So the time of day is a confounder, which is making X & Y go up together and giving you kind of a false relationship between X and Y and a causal sense, because C is moving up and down and triggering both X and Y a lot and so people just happen to open more emails, when they watch more netflix because those things.. those activities happen together and that leads you to learn the wrong relationship. And so how do we fix this and you can apply this we've tried.. we've tried this on various complicated machine learning models, but even in the simple situation you can still fix things even with linear modeling, the ideas in addition to your inputs and your outputs and your hidden confounders, which you don't get to know, Introduce some other synthetic perturbation called Z, think of this as some randomized variable that you control, which lets you send messages and not send messages randomly on some users.. So flip a coin and... say today I'm just gonna stop this message from coming out, or today I'm gonna double this users messages... or today I'm gonna have this user's messages. So some other synthetic variable that is wiggling, what actually happens on the messages, but isn't changing the watching behavior directly, you know, so we're not turning off netflix... Let's say say oh you can't watch today, that's what X is all the stuff we know about the user, the messaging, history, the time, the day, week, the country, the streaming. Why is the number of minutes... this is past streaming and the one it's the number of minutes.. they're gonna watch, let's say after the message or tomorrow and then Z is some randomized perturbation and so the simple approach is to do this with two stage least squares. So instead of doing, what 99% of machine learning does.. which is just learning function that goes from X to Y, You're gonna do two machine learning stages, the first machine machine learning stage learns how to go from Z to X, so you say how do I predict X from Z and then you reconstruct that X using just the Z, you get this curly X.. that is the reconstructed X and then you say now, let me go from curly X and see how curly X predicts Y and that gives you another function, that takes the Curly X's and tells you how those predict Y so you learn to machine learning.. models one goes from Z to X and then one goes from reconstructed X to Y, instead of just X Y.. but the nice thing is, this now is gonna be a causal model.. as opposed to they're just purely correlational predictive machine learning models, you would have learned.. when you just do one stage, and it turns out this has a long history in Econometrics and you can, see people doing things like this to prove that cigarettes cause cancer and how do we prove that cigarettes cause cancer? Well, if it's just an x and y relationship, some people argue like cigarette companies, Oh... the people are gonna get cancer later on in life crave cigarettes more than the average people and it's not because of cigarettes, It's because the cancer caused the cigarette smoking and not vice versa. So, how do you how do you kind of improve the causal relationship.. you introduce Z, which turned out to be cigarette taxes that states across the country, we're introducing with different levels of randomization and those change the behavior and then you could say there's a cause here and the cause goes from smoking to cancer or not Vice versa.. and so just as a quick summary, I'm not showing exact numbers here, but to summarize how our different channels work, here's a.. if you just did a single let's say one stage machine learning model and Just learn kind of a simple linear regression of how email push or in-app notification, those are three channels to reach their user. How does influence the minutes they will watch ? You will learn that sending an email reduces minutes and goes negative.. sending a push notification.. vastly reduces the amount of watching and sending in-app notification.. increases significantly the amount of watching. Those are all the wrong conclusions, and if you were to present this to your boss as a data scientist, you would say wait a minute.. this is completely silly. How can sending an email make someone watch less? How could sending it push notification to make someone watch less and it's because you're just doing one stage machine learning X to Y, and there's all sorts of confounders you have to fix so if you do the two stage version and you learn that second function I told you about, it actually learns the effects and they're all positive.. and they're all increasing in the best one to do is actually push notification. So you'd have made the exact wrong conclusion and told your boss, let's send in app and never send an email and or ever send a push instead of just we should send push notifications. So that's you know one causal aspect of machine learning that I think we should invest a little bit more into. Alright, so that's we talk to a predictive machine learning, we talked about what happens when you intervene and causal aspect of machine learning, and then let's talk also about the explanatory interpretable aspects of machine learning, because a lot of times the machine learning AI will do things.. but users want to know why.. they don't want to just get great predictions and this is a blog we wrote in medium 2017 and there's a follow-up paper as well, here we're using this concept of explaining your predictions in a system called image personalization and that's how we change the images on our homepage, so if you weren't thinking about images, this is what the Netflix homepage would look like, there's no images here, this is the best titles for you rank and organize nicely for you, but it's not very convincing, right? So another important decision once we've ranked all the titles and figured out where to place them, Is what image do we show for each title, Okay, so we're gonna use machine learning to also personalize this show.. this choice and here is let's say nine possible choices for the movie stranger things or the TV show stranger things, Which one is the best one for you? And so we're gonna approach this with machine learning, of course, but before I jump into how we do this exactly, let's think again about traditional machine learning and traditional machine learning involves batch learning, where we have two steps, we take a world of data.. massive data set and out of that massive data set, we have also a massive number of hypotheses or models or parameterised spaces of models and out of that space of models.. we're trying to find the best one, the one that kind of agrees with the data the most, and we do this with statistical efficiency and computational efficiency, where we can make guarantees that we're learning real generalizable models and we do this computationally efficiently, because we can't try every single model on the data in a brute force enumeratively right, we have algorithms like SGD and... and so on that find the best light bulb model. So that's batch machine learning and there's a lot of that happening in industry and obviously in academia, but here's what are the downsides of batch machine learning, If you have a many many users.. hundreds of millions of users over here, and this is a time timeline and you are a company, this is typically how the machine learning process starts, you collect a lot of data about the users.. that takes you know, maybe several months, then your scientists work, they learn a model, they try out different modeling ideas, they find a good model. Once they found that good model, you spent some time productionizing that model, so can actually go out and run in the real world.. that takes a lot of engineering work and then you do this thing called an a/b test, how many people have heard of a/b tests? Okay, so at least a good number of you, but a a/b test is a very simple concept, you take let's say half your users call them population A and another half of users population B, on A you continued giving the usual Netflix experience and on B you show the outputs of this new machine learning model and then you run this test for a few weeks or a few months and then you say who is actually watching more, population A or population B or who is retaining better who is renewing their subscription? Population A or population B, population B wins then you say okay... that's great, let's now release this model and roll it out to the rest of the world or the rest of the popular the Netflix subscriber base. Okay, so what's the problem with this approach? Yeah Okay, the future may not be like the past, that's definitely a problem that you want to handle and you know online learning will get to... will handle it, but what's another problem as well, Yeah.. [Audience Question :] Tony : Well, we learned the model on everybody over here and we randomized A and B, so that there's a that much dependency on population of B, but that is true that if you didn't properly randomize you would have a problem there. Yeah !! [Another question from audience :] Tony : The data may have changed by the time he got the a/b test as well, so all these are valuable points and also by the time you get to a/b test.. that is many months.. So, you know, it's okay if the data changed, but it was one day later, this has taken many months and so what happened is taken many many months the date has changed and also for all this time.. all your subscribers were getting really bad models until you switch them over here, and so you've got this huge amount of regret as well, Where we say, wait a minute I could have been showing much better models all along and instead I was wasting you know.. this amount of time. So online learning handles a lot.. of what you guys just pointed to... It's going to save time by not just waiting until we get all the data perfectly and then do all the modeling to get the perfect model and then run a perfect a/b test and see... okay who's really, Who is Doing better A or B, you kind of want to do this all online and you also want to handle the fact that the data might change and so the idea is to interleave learning with data collection, don't just wait for all the data to be available, explore also while taking better actions, so you don't have too much regret and the approach that many people propose is this concept called a multi-armed bandit and you can think of this as walking into a casino there's multiple slot machines, you want to pick a slot machine, which is gonna pay you a lot of money, you don't know which one is the one with the best reward distribution, so you've got to try a few and you can play one arm at a time.. until you figure out which one has the best payout and so you have to handle this kind of exploration and have cost before you find the best arm and so bandits.. you could think of as a very simple flavor of reinforcement learning, you've got a learner - the learner takes actions, the environment then gives back a feedback of reward saying Oh you got some money or you didn't get any money and that's kind of the simplest setting and the goal is.. you want to maximize the cumulative reward as you play this game.. Okay, There's another flavor called the contextual bandit where.. - the environment also provides a context feature vector, that describes let's say this particular user or this particular opportunity and then the learner needs to choose the best action for that particular context.. that particular user and then the environment gives a reward and then you again have to maximize the cumulative reward, so there's an additional concept called the context. And so how is this different from let's say traditional supervised learning, In traditional supervised learning you've got input features X - which is like the context we're talking about, the output is a predicted label.. you say here's my actual label for this input and then you see if you agreed or not with the actual label, But you get to see the truth, you get to see the actual label after you make a prediction. In a contextual band-aid you've got input context, some vector X describing the user or describing the particular opportunity today. The action is going to be some function of X, It's going to be A.. which is.. show this image or show that image let's say, then you're going to get a reward, but you see the subtle difference, you don't know.. what the best answer was and supervised learning you get to see what the actual real best image was that's the best image for this person, here we're not going to get to see the best image, we're just going to get to try out an image and say oh this worked or didn't work, so you get less information when you switch the contextual bandits.. then when you switch, when you have supervised learning. So here's an example - this is an input image, the label is cat that you predicted and then you get shown.. the actual true label that it was a dog and you.. you know you made a mistake, but also we're showing the actual answer and so, you know how to correct very precisely, In a contextual bandit.. you're showing this image, you predict cat and you're told no, so then if you try again, So obviously.. here you get to try again, but you know the answer.. you always didn't get it right after you've been shown that label in the contextual bandit world, You still don't know that this is a dog, you just know that it's not a cat and then you have to try Fox and you keep going, It's much faster to converge on the supervised learning a world.. But over here, you're still kind of making a lot of mistakes because you're getting binary feedback rather than the actual true label. Okay, so we're going to have this problem with our artwork because the the users aren't going to say.. hey, this is the best image for me. Okay, you have a show called stranger things, show me this for it... I like that image, no one's going to do that.. we're just going to try showing images and seeing if they work or not on different users Okay, and so that context is a user's view history, their country, the action is the choice of image we're gonna take and the reward is did they click and play or not? Okay, so you know clicking and playing and watching enough... Let's say that or enjoying, So, what does that mean? That means they should engage with it, they shouldn't abandon it too soon and they should kind of enjoy it, Maybe give it a thumbs up once they're done. Okay, so what is our reward really.. we can think of it as like, how many times people watched, but it's also how many times people watch.. given the number of times we tried, so we don't want to try a hundred times to get one watch.. we want to try as few times as possible to get as many watches as possible.. and so here we have three different users and they're gonna have all these different shows and we're gonna show them also altered carbon.. Ok, so altered carbon is being shown over here.. here and here and It turns out for this image, we showed it to three different users and our take rate isn't gonna be necessarily how often you played from college, all the things you were shown, but how often this image worked for the number of users that were exposed to it. So the take rate here would be 1 & 3, we tried three times and one led to a play. So that's really the we're thinking of this bandit as try on different users, not try each impression. Okay, and so then it's a kind of a user based kind of pulling of the arm.. rather than session based randomized arm pulling. Okay, So the algorithm we're gonna consider here is "Thompson sampling for multi armed bandits", and I'm showing the simplest version of it now but the idea is we're not gonna pick the best single model right away.. which is what happens with VAT machine learning, we're not gonna collect tons of data and then say here's the best model.. we're gonna have a distribution over all models call that Q theta and we slowly want to eliminate the models that kind of aren't working, until we find the good models and the way this works is.. You know, we're gonna sample let's say from at this distribution.... I call the Q but here it's really the posterior, sample a model given everything you know.. so far all the data, then you observe an X.. which is the actual context and then you say now let me pick the action.. that would give me the most reward given my modeling so far and I take that action and then I collect the real reward from the environment and then I increase my data set saying okay, I Saw this context, I took this action and I got this reward... I add that and I changed my modeling distribution. So at first we start off with a uniform distribution over all possible models and over time as we try things out on different users, we start eliminating some models and reducing their probabilities... That didn't work, that didn't work.. some work a little bit and then eventually as you run this type of experiment, you're changing this distribution.. so that it eventually peaks and puts a lot of probability on the best model.. rather than doing tons of data and tons of exploration and a/b testing and so forth you interleave it all and then you slowly kind of watch the best model emerge.. okay, that's.. that's the idea. So here's an example - when we do it without context, so X is just the empty vector, this is how we pick the best image, Let's say overall for this show - "Unbreakable Kimmy Schmidt". So we ran a simple bandit.. not contextual bandit. Did you find out that this was the best possible image, so this is much better than doing random image uniformly for a long time until.. you figure out that's the best you do kind of an adaptive "Thompson sampling" and it quickly converges on to this image - with much less regret, logarithmic amounts of regret versus linear amounts of regret, if you want to get into the theory precisely. The traditional approach will have linear regret, let's say.. Now if you want to do something which is more personalized you say, okay, let me put in X something more specific about that user like their view history, their country and so on and then we could say okay so users with a lot of romance in their view history, so this user watches Romantic TV shows... this is the image, they should be shown for "Good Will Hunting", that's what the Bandit will learn. A user which watches a lot of comedies.. as we do this exploration the model start to figure out... Oh show her them this image for Good Will Hunting and so it's figuring this out again by doing this explore, exploit thompson sampling, it's not studying the pixels here and saying oh this is, you know, a funny comedian like Robin Williams, so that goes well for comedic view histories, this is a romantic image that goes well with the romantic view history.. It's literally doing this only as a multi-armed bandit without any pixel knowledge here and then here's another example - where it figures out for users that watch a lot of Uma Thurman movies, we should show this picture for pulp fiction.. for users that watch a lot of John Travolta, we should show this picture for pulp fiction same movie.. but it figures.. that this is going to be more enticing for those users, these are real examples for these real users.. we have that in their view history... So how do we know how well this works? So once we have this system.. you want to estimate the performance you can go out and run an a/b test, but that's dangerous because you can you might have settled millions of users. So we have to estimate the offline performance and there's a technique by "Lee Hong Lee" and others called "replay" and I'll describe how we use that to evaluate.. How well the system will do before we try it out for real? Okay, and so this is Let's say how we devaluate how well the system perform, we have a bunch of users and we have two possible images for the show disenchantment. We randomly flipped the images across these six users and these three got this image and these three users got this image.. And that's what we logged in some exploration in the past. Now we have a new algorithm that I just showed you, this Thompson sampling contextual bandit algorithm, It's gonna make some choices.. We want to estimate, what's the take rate for this new algorithm? Okay, so here are the actual choices this algorithm make.. and it agrees sometimes with what happened in the logged history, but sometimes it says oh, I really wanted to switch from this to this.. for this user and for this other user I want to switch this image this one for instance and then we see where.. there were actual plays these three users played.. disenchantment Okay, so you could say that the take great for disenchantment upstairs was what? Fifty percent... Yeah ! Downstairs obviously, we're not gonna get more than 50 percent plays.. However, what we do is we only count the situations where the algorithm agreed with what happened in real life and so we remove these three situations, we say okay the algorithm agreed with real life here.. here and here and so then the new algorithm has a take rate of a hundred percent.. That's estimated as opposed to 50 percent, So this is twice as good because you just say, whenever you disagree with reality don't count that in your reward calculation. Okay, or sorry. No. these are the ones that had plays.. this one didn't get a play, These are the ones that had agreement and these are the two.. so it actually was at a take rate of.. So initially, it was a take rate of 50% or sorry, you know 33% because only these two had plays, but now the take rate is 66%, so it's still a doubling, Yeah.. so 66% is the estimated take rate as opposed to in the previous situation, where we had a one-third take rate.. okay, and so this is actually a very interesting technique, there are proofs by "Dudik Langford & Li" and others on how this is unbiased.. so it'll give you the right answer asymptotically as you get more and more data, it's easy to compute rewards or that you observe are real, the cons is you have a lot, you require it requires a lot of data because you have to coincide with your algorithm with what was actually happening in the exploration policy, so your new policy and your exploration policy.. if they don't agree very often, then you don't have enough data to really compute take rates, if they never agree you have zero coincidences and so that also leads to high variance if you have too few coincidences and there's two other techniques that go beyond just typical replay like there's doubly robust estimation there's a self normalizing IPS technique by "Thorsten Joachims", a few other techniques as well, but here's roughly what we predicted before we actually try this out in real life, this is how well random Image selection was doing and these are tiny little bars, this is how well just unpersonalized.. single best image for everyone works, so that's the non contextual bandit in yellow and then the contextual bandit in blue has the highest take fraction and you can see the error bars are tiny, so this is significant.. so this was kind of strong evidence.. that this works and it finds better images and then what we have to do, of course is we have to run a full a/b test to really prove it out... But what we realized is.. here's the opportunity size before we go out and run the test, we show that actually the artwork variety matters, It's not just finding one image and then throwing away all the others, but for each user there's a subset of images are good for them, so you want to keep around a portfolio of images.. versus the old world of here's the best image and then let's forget all the others. One of the issues is this has to scale before you do an a/b test.. turns out we have to serve for every title and every user the right choice of image, that requires a lot of calls from the home page and the search page and so on and that's roughly 20 million requests per second at peak, that's how many people and titles are being are being kind of coincided and and match two images right, that's really very fast, and so some engineering solutions.. I'll skip the details is to pre-compute all the solutions rather than try to life compute all of this, but of course if you do pre compute, there's other issues, sometimes you have a cache miss and so on, and so what we do is we actually we first do the personalized solution if possible, but if there's a cache miss, if there's some other issue with the pre compute.. we go back to the unpersonalized winner and If the unpersonalized winner for whatever reason is not available because of some again engineering constraints, we switched to any default image that you know the the actual content creators chose great. So we tested it online as a real a/b test, it worked and it helped increase streaming and retention and so forth all our metrics and so now it's rolled out to everybody in our 140 million plus member base and it's most beneficial for lesser-known titles - which makes sense, like a new title.. No one knows about it, Everyone knows what pulp fiction is, so we don't really need to pick the images that well, but for something that just got added to the service that no one knows about like, you know maybe a bright for instance... that was a year ago, no one knew what the show is about, you have to pick the right image and explain why that's being recommended to you.. Oh ! we're gonna recommend this to you because you like action and so we're going to show you an action version of that image versus .. Oh this person clearly likes watching kind of fantasy and orks and so on, Let's show the orc in that image... So that gives you an explanation for why you should care about that title and it also reduces kind of competition between images that across the page. Here's the video Okay, so this again helps increase viewing again, because you're not just personalizing the ranking or the layout but you're also personalizing the imagery. We're also moving this now beyond just the actual image as you see here, In all the different kind of choices you have on the page... we're also looking at ways of personalizing the entire page with all the possible evidence and images and descriptions and so we personalize the ranking that's been going on for a long time, we personalize the choice of rows and the layout that's also been going on for a long time. But now we're also going to We were personalizing the images.. that you see as well, but we can also imagine doing kind of Contextual banners to pick the text that you see for that role and make that also more meaningful for you.. Just like the images we can try different texts there, we can also personalize the evidence you see here, which describes that title and say how do I communicate a synopsis that's more relevant for you and using again the same machinery and then also metadata, like is this getting an award.. You know, is it HD or not and then the trailer we're also doing is personalization now with the same engine really, we have many trailers for the same show and.. In fact, we're even going forward to actually say how do we predict from the raw pixels.. What the take rate will be ? And that's another kind of research project now because then we can actually predict it, before we start randomizing and doing exploration with the contextual bandits. Here's kind of another nuance.. I also wanted to point out which is, what happens when there's non-compliance and I'll just spend a minute on this and just kind of run through the slides very quickly and Or another example is what happens if we're sensor, so we've talked about good prediction, we talked about being more causal, we talked about also explaining your predictions with kind of images and kind of evidence like we've seen with machine learning - that explains the why.. but now also what happens when we actually run in the real world and we're censored and it turns out - even at Netflix.. even if you're running the algorithms than production, there's a lot of other stuff happening, other engineering systems, business logic, that actually prevent you from taking the actions that you want to take. So you want to show this image... well, it turns it on.. for legal reasons, we're not allowed to show that image in that country or there's some other engineering constraint and that image gets a lot of cache misses for whatever reason because of the way it's stored in our data structures and it didn't make it on time. So there's all these actual things happening under the hood and those are kind of related to this idea of non-compliance, which people are now starting to look at in contextual bandits, and so sometimes the actions are bandits, who want to take.. are getting blocked and so for example, this is an issue with drug trials at FDA, you know, you have patients coming in and they're randomly assigned to treatment.. but then they actually do not comply with that treatment, the doctor says take three pills a day and they take two pills instead or they take four pills a day and similarly in a product setting, you know business and engineering constraints might prevent also us from showing this image or taking this action because there's other stuff happening with these massive distributed recommendation engines and streaming platforms. Okay, so I'll quickly summarize, that how we would modify Thompson sampling to incorporate the fact, that we're gonna get sensor sometimes, so instead of just thinking context action and reward which is what we were talking about so far and this is the kind of classical Thompson sampling algorithm, let's imagine a situation, where we don't also just get to take this action A, but there's also in addition a Proposed action Z, So instead of saying it's XAR context action reward, It's X then Z which is your proposed action and then there's a real action that gets taken A and then you observe the reward. So this is kind of the Thompson sampling with kind of compliance issues, and so Penna in 2016 proposed this algorithm called "Thompson Sampling Checking", which has ignore updates - when the real action didn't agree with the proposed option. Of course, that's not so helpful. When that happens often.... you're basically..you know, ignoring 90 percent of your data. Let's say when the real actions didn't agree with the proposed action, and it's slower to converge, and so we proposed this version called non-compliant Thompson Sampling, where we say, let's actually pause it that there's in addition to a reward model - which is roughly what we were learning all along with traditional Thompson sampling, there's also a non compliance model, which says for this particular proposed action, there's a probability of it switching to some true action.. think of it as a transition matrix from proposed actions to implemented actions. So this is really the main difference, we're going to estimate this in addition to estimating the rewards and we can also handle hidden proposed actions and hidden implemented actions as well, we can have latent variables on those, i'll skip that, but this is how.. it's doing in simulation and we see much better - let's say.. much better scores.. when we have let's say.. TS observed versus TS check Versus TS.. so Thompson sampling classically is not getting as good a rewards, let's say history.. then the TS check sometimes, but then the kind of typically better approach is TS observed, which is what we're proposing, which is inferring the non-compliance model, Instead of just saying let's ignore the situations when there were.. there was non-compliance and Z did not equal A and here's a real data set where we are trying this on a stroke trial and we're looking at the regret in hindsight, and so our algorithm here is in red and it's beating other classical Thompson sampling and this other two stage least squares technique that was also proposed to handle non-compliance, and so if you have situations like we do in the real world, where you try to take actions with your machine learn contextual bandits and they for some reason don't get actually implemented, you know.. how to correct by just learning that Modeling of the non-compliance basically, modeling how you get overridden by downstream engineering systems and downstream humans, which say.. Oh I'm going to ignore what you've shown me or I'm not going to take my medication the way you've proposed. So I'll stop here and switch to questions, and also we are looking for strong PhDs in the space, so we're hiring in multiple physicians, take a look at research Netflix com, Thanks... There's there's a mike back there I think.. [Question from audience :] You have situations where you have multiple users with one accounts and if you do then recommend, so it's sort of the intersection of their behaviors. Tony : So that's a good question. We do have multiple users on multiple accounts. And how do we handle that? So it turns out sometimes you can identify who you are.. through profiles and you can create profiles saying ok, this is you know, Tony or this is you know.. someone else at the household or my daughter and then obviously the view history is gonna be different because watch different types of things, but sometimes when people don't even do that we still just act like It's all one user and it still works out in the end and it's because even one individual human is in a way, we're all kind of a mixture of our different contexts and moods. And so sometimes you're in the mood for you know, a really thought-provoking deep drama and sometimes you're in the mood for some comedy and something light-hearted and so in a way you are also consuming different types of kind of personalities of content and so it actually still works even if multiple people are sharing the accounts, but we prefer obviously for people to switch the profiles because then it becomes an easier inference problem. Yeah. [Question from Audience :] Is it individualized for different users and if so would it be beneficial to also transport the models across different groups of users? Tony : So that's a good question, yeah the model I proposed, I only summarized the kind of non personalized aspect of it, just the channel but it is personalized.. as well based off of other aspects of the user like, how much they watch in the past.. other types of profile information, so there's other features about the users. I just only showed three kind of outputs, to summarize just the effect of the channel, but it is a personalized model under the hood the causal model, yeah. I Anna :Also wanted to ask about data privacy issues.. Are there any data privacy issues in any context that are related to you know.. that are being explored and Netflix and that are being addressed ? Tony : Sure, I mean so data privacy is a very important question and that Netflix takes a really strong stance on it, you know these days we don't release data. In 2006, it was a very different world, but now there's a lot of additional regulation around data and the other fundamental difference is, Netflix is a subscriber business.. so we really are only doing this to benefit the actual subscribers. All the data is really for the benefit of the users rather than for some other advertising or other types of things, so we don't share the data, we don't do kind of any.. any kind of anything with the data, that would be let's say maximizing profit without really maximizing the engagement and the enjoyment of the users. So in a way we're very aligned with how we use our data at Netflix. Audience question : One last question, so Netflix is creating a lot of context, so do you use this material to help the context people and do you sell it to Hollywood? Who is also creating in context so? Tony : We work with content creators and a lot of them really care about how you know there are titles are being handled in the service, will provide them with.. you know, It's the anecdotal information, but we don't necessarily share, you know the specific data, we'll give them insights saying, you know, this title did really well and this one was not as performing, but we're not let's say revealing the specific data that you're seeing in these types of systems, again because of this is still sensitive data.. But yeah, but content creators I think really care about how a lot of this effects the consumption of their content. Yeah.. Anna : Ok, so let's thank the speaker again and Tony is gonna be with us all day. So there's gonna be occasions to ask questions. Tony : Thank you !!