Intuition & Use-Cases of Embeddings in NLP & beyond

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let me begin by a question and indulge me on this a little bit do you remember where you were last year when you first heard that humanity had finally beaten the Turing test this is what I'm talking about this was in May of last year I think most of you I've probably seen this but let's let's take a quick look at it this is a demonstration of the Google assistant talking with a shop owner who does not know they're talking to a machine I'm looking for something on May 3rd give me one for what time are you looking for low at 12 p.m. we do not have a false p.m. available the closest we have to that is a 1:15 do you have anything between 10:00 a.m. and 12:00 p.m. depending on what service she would like what service is she looking for just a woman's haircut for now ok we have a 10 o'clock 10:00 a.m. is fine okay what's her birth name the first name is Lisa ok perfect so I will see Leath I have 10 o'clock on May 3rd ok please Thanks great have a great day so this was May of last year and I was actually there in the audience this is Google i/o and I had been aware of developments in NLP and natural language processing I knew what a lot of these systems were capable of but I was really shocked upon seeing this I had no idea that we were this close to this and for a second I I felt wait this does this constitute beating the Turing test and in reality does not this is a constrained Turing test the the actual test is is you know a lot more difficult but it's a really good preview of that how that will ultimately sound like so this is called Google duplex and I don't believe we have a paper on it on how it works but we have a blog post about it and in the beginning and in the end you have text-to-speech and speech-to-text and that's the kind of technology you've used maybe with Siri with Alexa with Google assistant but I think a lot of the magic happens here in the middle we can call these in general maybe sequence to sequence models which would take words and they would take them in the form of embeddings which are vectors which is the topic of the talk which is basically just a list of numbers and then they would do the calculation and do the rest duplex is only one manifestation of a number of natural language processing systems that are you know they keep developing super fast this is a picture of how Google Translate works this is from a paper back in 2016 to break it down into the major components you would put the input words you would turn them into embeddings and that's how you'd feed it to the model so models deal with word or understand words as vectors in this case the embeddings are actually of parts of words so play Inc would have play would have its own embedding and then IMG would have its own embedding and then Google Translate does it and encoding a step and a decoding step and if it it outputs words in the other language so these models have been developing at a pace that is that is you know tremendous we use them everyday in our phones in our computers when we type it's like we and the machines or you know we're depending on them so much that we're starting to complete each other's as sentences it's not perfect but it's it's developing pretty quickly think about the open AI GPT - that was published 2 months about a month ago that was capable of writing tremendous essays this is one example of it going on a rant about how recycling is bad and you can easily see this you know I can easily compare this to you know comments I've seen on reddit or in Facebook there's a lot of conviction behind this and a lot of this we wouldn't think that it was generated by you by machines so this is another example so we have a number of NLP systems and models that are continuing continuing to do amazing things and a lot of it is in you know just the last 12 months these are some of the examples and right now we're looking at these technologies that are enabling us to understand the complexity of language and we're saying maybe there's a way to use this firfer to solve other complex problems to find pattern in other sequences of data that we might have so the main concept that we're going to extract out of all of these models is the concept of embeddings we'll have three sections in the talk we'll talk a little bit about you know an introduction how embeddings are generated and then we'll talk about using them for recommendation engines outside of NLP and then we have a lucky number 13 ominous section ominously called consequences and I hope we have enough time to get there as you've seen maybe from the first slide I'm using the the dune sequence of six novels as the theme so they're gonna be quotes here and there my name is Jael amar I blog here and I tweet there I've written a couple of introductions to machine learning I've written a recap about the developments in natural language processing the most popular post on my blog is the Illustrated transformer which illustrates the the neural network architecture called the transformer which actually powers the open AI GPT 2 it powers the Bert model shown here it powers deepmind's alpha star which plays Starcraft 2 you know a complex strategy game and it was able to beat professional players but I also have some introductory posts there as well I've created videos working with Udacity for the machine learning on a degree program and the deep learning nanodegree program in the day on my VC we're the biggest venture capital fund in in the Middle East and from that perspective I try to think about these algorithms and how they apply to products and you'll see some of the examples here of we're going to talk about the algorithms if you also talk about products and how that sort of reflects on on products let's begin with an simple analogy just to get into the mood about talking about how things can be represented by vectors do you know these online personality tests that you know can ask you a few questions and then tell you something about yourself not silly ones like this but maybe more likely MBTI that would tell you something about score you on like four different axes more commonly used is something like the Big Five personality traits that's more accepted maybe in psychology circles you can take a test like that then it would rate you on each of these five axes and would give you a score saying how it would really tell you a lot about yourself one way you can take this is this 538 page so you can go on there you will they'll ask you thirty multiple-choice questions and then they will give you five scores along these different axes and they will tell you some things about your personality some that psychologists have been I've been studying for you know tens of years these five scores and then they they show you this graph and they they show you how it compares to the national average and to the terror staffers and you can send it around to your friends and compare so you can you can take this after after the so this is a form let's say of embedding so this is my actual maybe score along one of these axes so I would score 38 on the extraversion which means I'm closer to the introversion I would thought I would be closer but you know I'm near the middle so that's one number that tells you one thing about my personality let's switch the range from zero to from zero to a hundred down to minus one to one just so we can think about them more as vectors now this doesn't tell you a whole lot about me it's one number it tells you one axis of my personality but then you need a lot more numbers to to actually be able to represent the person so let's take trait number two and I'm not saying which straight that is because we need to get used to not knowing what the vectors represent and so that vector would kind of look like this now assume that maybe before coming here today I was not paying attention and I got run over by a big red bus let's say Q con needs to replace me very quickly there are two people these are their personalities assuming they know just as much about the topic as I do which one has the closer personality we this is an easy problem linear algebra gives us the tools so we have similarity measures that we can compare vectors so the commonly used one is cosine similarity and then we give it the two vectors it would give us two scores and we'd get the one with the more score with the higher score but then again two numbers is also not enough you need more numbers to actually psychology thought you know if they call them big five because these five tell you something a lot but some of these tests would give you maybe twenty or thirty even scores or axes the thing is when you we go beyond two or three dimensions we lose the ability to draw things plot things as vectors and this is a common challenge with machine learning we always have to jump really quickly into higher dimensional space and we lose the ability to visualize things but the good thing is our tools still work we can still do cosine similarity with however a number of dimensions that we have so two ideas I want to emerge from this section first you can represent people you can represent things by vector number an array of floats if you may which is great for machines machines need numbers to do calculations and they're very good at that and then too once you have these vectors you can easily compare the similarity between 1 or 2 or 3 or a hundred you can easily say customers would like J also like and then rate by similarity and then just sort and you can see where I'm going with this but before we get into recommendations let's talk about word embeddings so we said with people you can give people a questionnaire and learn about their personality you can't do that with words the guiding principle here is that words that occur next to each other we can infer a lot of information from that we will look at out the training process works but first let's look at an actual trained word vector which is this this is the word vector for the word is a word vector for the word pink this is a love so there are a number of different algorithms this is a glove representation it's in 50 floats it's trained on Wikipedia in another data set and you can download this data set and it has four hundred thousand words King is one of them the thing is you know you can't really by glancing you can't tell a lot there's a lot of you know numbers and precision so I wanted to have a more visual representation so I was like okay let's okay this is okay this should be white boxes so I said let's put them on just one row but then let's also color them so these numbers are all between 2 and minus 2 so the closer they are to to the more red they will be the closer they are to minus 2 the more blue they would be and if they're in the center they would be let's say white and so this is one way that you can look at a vector this is the the word vector for King all right let's look at some more examples can you find any patterns here King man and women comparing them you can see that you know between man and women there are a few things that are a lot more similar than maybe man to King these embeddings have have captured something about the meanings of these words and they tell you about the about the Simula similarities between them we can go one step further and say I you know have this gradient for you where okay Queen and then woman and girl and you can say between a woman and girl there's a lot more similarities than the rest between girl and boy you can see these two blue ones that aren't available in the rest could these be coding for youth we don't know but there are similarities captured in the word vector where there are similarities and the meanings that we perceive them and I put water there in the end so you can see that okay all of the ones above our people this is an object how does that do anything that does anything sort of break you can see that red line goes all the way through but then that blue line breaks when you get to it to to the object literally the most one or more interesting ways to explore these relationships is analogies and we can say this is the famous example from from word Tyvek which is if you have the word vector for the word king you subtract man and add women what would you get Queen exactly so two things you would get a vector that's very close to Queen this is the Jensen library for Python GE and si M you can use it to download a pre-trained vector and you can say okay king and women and then subtract man what are the most similar vectors around this this resulting vector and to be queen and this would be the similarity score between it and and so by a large margin it's more similar than any other words from the 400,000 words that the model knows and when I first read this I was a little bit suspicious I was like does it equal it exactly and it doesn't equal it exactly so these are the three words this is the resulting and then this is the the closest vector but it will be the closest vector to it it wouldn't quit exactly but it's you know it's it's approximated it's the closest vector from from the space this is another way to represent the analogies you can say France is to Paris as Italy is to and you have the answer there it's to Rome so that's really powerful but we've knew all of this since like two thousand thirteen fourteen I guess these examples are from the word Tyvek paper here and they have this visual of so they run their embeddings are three hundred dimensions they shrink them down to two dimensions using PCA and then you would find the countries would be on the left the capital cities would be on the right and there's be very similar distances between the countries in the and the capital cities to talk a little bit about about history and how how word vectors came about we need to talk about language modeling when I try to think of an example to give somebody of an NLP system the first thing I think of is Google Translate but there are better examples there are examples that we use tens or hundreds of times every day our smart phones their keyboards that predict the next word for us that is a language model how do they work I've had a hand wavy idea about okay so it scanned a lot of text and sort of it has probabilities and statistics but let's take it let's you know take a look at how they would really work let's assume that we shape the problem as it would take two words as input and then without put the third word as its prediction what we can think about it like this this is a model let's call it a black box for now to take two words as input and would output would output a third word and with the tax of the task of predicting the next word so this is a very high-level view doesn't you know the model is still a black box will will slice it into layers so the next layer is to say if we consider the initial neural network language models they would not output to you one word now output to you a vector the length of this vector is the length of you know the vocabulary that your model has so if your model knows 10,000 words it will give you a vector of 10,000 values each value is a score for how likely or probable that word is to be that the output and so if this model is gonna output the word not it would assign the highest probability to the index in that vector associated with the word not now how does the a model actually generate its prediction the first it does it in three steps the first step is really what we care about the most in when we were talking about embeddings so it has the word bow and Scheldt so the first thing that we'll do is to say give me it would look up the embeddings of the words thou and shalt and it would do that from a matrix of embeddings that was generated during the training process and then these would be handed over to be you know to calculate a prediction which is basically multiplying by a matrix or passing it through a layer a neural network layer and then projecting it to the out to the library and then the details of this model is in this been geo paper from 2003 so these are the early this is just a look at how I predicted how a trained model would make a prediction but then we also need to know how was it trained in the first place the the amazing thing about language models is that we can train them on running text we have a lot of texts out there in the world that's not the case with a lot of other machine learning tasks where you have to have features that were you know handcrafted we have a lot of text in Wikipedia we have books we have news articles we have tremendous semantics so if there's a task that can be trained on just running text that's incredible and that's what we saw with you know something like the the GPT to which was trained on 40 gigabytes of text crawled over the Internet just from reddit so there's no shortage of text so language models that's an attractive feature of language models so let's say we have an untrained model that would take two words and output a word we'd throw Wikipedia at it how is that training you know prepared so we have sort of our articles we have extract the text out of them we basically have a window that we slide over the text in that window extracts a training let's say data set we can use this quote from from dune again to look at an example of how that window is a process so windows beyond the first three words we have the first two words in the left they would be the inputs and then or they we can call them features to our data set and then the third word would be the label or output we slide our window we have another example we slide our window we have a third and then we have we have you know 40 gigabytes of text we'd have an incredibly long table now if I ask you this question and you have a little bit more context so a model might know might only be able to see the previous two words or the previous three words you can see the previous five words and you have a little bit of context from earlier in the in the in the speech in the talk so what would you put in the bus right car is is also a good so is it bus what if I give you two more words on the right side of that word it would be red right but then you didn't know that that information on the right was not given to you and there is value in that the context you have to look at both there's information on both left side and the right side so if you use them in the training or when you create our embeddings there there would be there's value in that one of the most important ideas in in these models is called Skip Graham so we said okay let's look at the two words previous and the towards after the the word that we're guessing and two is a random number you can have it as five five is more often used you cannot extend so that's a hyper parameter that you can change based on the data set then but let's look at two so how would we go about generating the this kind of data set that looks both sides we'd say okay red is our label the two words before then the towards after it or our features and so our model would look or our data set would look like this we have four features and an output and this is what isn't what's called a Cibao continuous bag of words so it's it's it's widely used but one that is even more widely used is called skip run and it's flips things around and does things a little bit differently than than continuous bag of words it says I will use the current word to predict neighboring words but the thing is with every time you slide that window you don't generate just one example you generate four or however many your windows are so you would be able to you know or the goal of the model is to predict by if it was given the word red also uh or bus so with every time we slide that model we have we have four or however many other windows like so let's look at it angular at an example of a sliding that thou shalt not to make and then not is the word we're focusing on now we'd have four examples we slide our window we have four more examples and then you go along the text and then you creates a lot of examples then we have our data set and then we're ready to sort of train our model against and this is this is you can think about this about you know as a virtual way you don't need to train the model in this sequence but this is a cleaner way to think about it is that you extract the data set first then you train the model against it so it's it makes a bit more sense if you think of it that way so we go over our first model we give our feature to the model we say okay and the model is not trained it's randomly initialized we say okay do the three things look up embeddings and it it says garbage embeddings they're you know randomly initialized it hasn't been trained to do anything in the predictions and in the projection are not gonna work well when we know that so it would output a just a random word but the thing is we know what word we were expecting we're expecting about we were like okay no you outputted this but this is the actual target that we want this is the difference so this is the error in how much your prediction was off and that error we feed back to the model so we update our embedding matrix we updates or two other matrices and the model learns from the earth and that nudges the model at least one step into becoming a trained a better better model and then we do that with the rest of the and that's just you know general machine learning templates another one problem with his approach is that this third step projecting to an output vocabulary is very computationally intensive especially if you're gonna process a lot of text so we need a better higher performance way of doing this and to do this we can say alright let's just split the problem into two problems let's say step one we're gonna create high quality embeddings and then step two we're gonna worry about a language model that outputs the next word and then step two we can very conveniently ignore in this talk and only focus on on number one because we want to our goal is to generate high quality models how can we do that we can change the task from saying predict the neighboring word take one word and then predict the neighboring word to we'll give you two words and the models should give us a score from zero to one saying are they neighbors or not so if they're neighbors it would would be one if they're not neighbors the score would be zero if it's in between it's in it's in between and so this model is much faster this is no longer a new neural network it becomes a logistic regression problem and you can train it on like millions of words in a few hours on a laptop so there's a huge tremendous performance boost there a lot of these ideas come from this concepts called NCE noises contrastive estimation and so these are some of the routes that you can you know you see where a lot of these ideas bubbled up if we're changing the task we have to change our data set so we're no longer have one feature and one label we have two features and then we have a label which is one because all of these words are neighbors that's how we we got them but then this opens us up to a Mart ass model that would always return one actually that's the definition of the entire model so to return one and that would be perfect accuracy it would fit the data set incredibly but it regenerates terrible embeddings so we can't have a data set of only positive examples we have to challenge it a little bit so we want to space out and we didn't delete anything we're just spacing out our examples and we're saying okay we'll give you a challenge we'll get some negative examples of words that are not neighbors so for each positive example we'll add let's say two you can use five or ten negative examples and then but what do we put here what are words that we know are not neighbors we can just randomly select those from vocabulary and so we randomly sample them they are negative examples that were randomly sampled so these are negative this is negative sample there's a little bit more detail that goes into you know you can count them so you can negatively sample words like ah or the that don't give you much information but that's that's a detail you don't need to worry about now with this I'd like to welcome everybody to work to Veck these are the two central ideas about about birth defects that are being used right in recommendation systems and these are the building blocks that we needed to establish before going into so to recap we have text we have running text we can slide a skip gram window against it we can train a model and then we'll end up with an embedding matrix containing embeddings of all the words that we know by the same token if we have a click session if we have a user going around clicking on products on our website we can use those or treat those as a sentence we can skip ram against those and we'd have embeddings for each item each product that we have that we can use to do very interesting things we'll get to that in a second but an important thing to discuss when addressing embeddings is that they encode for the biases that are available in the text that you train them on and so if you look at analogies man is the doctor is as a woman is to what with the mala put here nurse exactly and so this is a data set that was not trained on social media this was trained on Wikipedia these are like data sets that you wouldn't think would encode for for biased to this level and this is the same thing with text that is in trained against news articles and so we can't blindly this is something Martin hit also this morning we can't blindly apply these algorithms we have to you know we will figure out that there are problems in a really good paper that addresses this and examines these biases in word vectors and gives examples about how we can be by us them and actually there's a very interesting things of projecting words into a hey versus she plot and it tells you with what occupations are most associated with she versus he so highly recommended reading to know a little bit about about the bias that is encoded without thinking in these models with that we have completed our introduction about NLP and we can start talking about using written embeddings in other domains so Airbnb have this incredible paper I have a link at the end they say you know Airbnb I'm sure you know is a website where you can go and book a place to stay say a user visits the Airbnb homepage and you record that and your let's say your log they visit a listing and then they go do a site search they go you know search London or something and they search another list they click on another listing and then another one we can delete everything that's not a listing from this click stream let's say or a click session and we can do that with a number of our users and this paper has done this with I think a hundred and eighty million click sessions and then we can you treat those as sentences because when these the assumption here is that these users encoded for a specific pattern that they were looking at when they're we're browsing these listings in succession so how do we extract the sort of pattern out of out of these listings skip Brown we treat them as sentences we skip Graham against them we create our positive examples we get negative samples from randomly from from the other listings and voila we have an embedding for each listing that we have on our site now the next time a user visited list this is visits listing number three we can say okay we have the embedding of listing number three we can just multiply that with this entire matrix and that would result in this scores the similarity scores of each vector each listing to listing number three and so we can easily generate a list of most similar listings and we can just show them to the user and that would show up they go one step further they go Locke's had a few steps further but we're going to talk about two so let's say we've shown these three recommendations to the users and they clicked on the first two but they didn't click on the third one is there a signal here that we can extract from this interaction to improve our model what they do is they said okay this one that was not clicked we'll add that as a negative example and so when we're training or doing our skip grammar Tyvek model we'd know to space the embedding for listing number three a little bit farther from the listing of four one three four five and so that feeds the the model you can continue training it using this example and I really one of the things that really stand out to me in this papers that they use the word to vac terminology and and tools to actually improve it this is another one isn't got a great one so you have click sessions let's say the first two users have not they're not booked anything they just visited one a number of listings one after the other but the last one did and they booked that that last listing number 1200 is that a signal can that can we encode that in how we embed our our listings and what they propose is that okay when we're doing skip gram we need to include that finally that ultimately booked listing as a positive example in every window that we slide even if it was outside of the context so for this one session that was it ended up in the in the booking let's associate every listing that the user saw with with this last one and so when we you know do the skipped round for the first one listening 1200 is there as a positive example then when we slide it it's also there so we'd say get like a global context this is the paper it's it's tremendous the first author has been thinking about this since his time in in Yahoo he's been writing about using word to vector as in recommendations for for a long time so highly recommended reading they showcase some of their results so they say they have this tool they say you give it a the idea of a listing so they chose this treehouse and when they search for it there the tool based on this method actually gave a number of other treehouses it they rolled it into production because it improved their click-through rate of similar listings by about twenty one percent and every baby is pretty sophisticated when it comes to this stuff so what they were using before is not something that was you know simple or so I think that this really counts or counts for something a couple of more ideas that we don't have enough time to get into is that they they find a way to project both users and listings in the same embedding space so you can choose a user and then you can find the closest listings or other users to them and so you can really start bending space with these with these concepts another example we can think about which is kind of similar but it starts from a different place is Alibaba Alibaba has one of they may be the largest market places on the planet where consumers can sell to other consumers it's called Taobao I believe and if you have you know millions or hundreds of millions of products you can't expect people to just browse through them you really need to rely on on recommendations and that the majority of their sales are accounted for by recommendations and and and views so how do they do that they start with the click sessions but they don't skip them on them they do something else they say okay let's build a graph let's take the first two each one would be a note and we'd have a directed sort of edge between them and then let's take the second pair and then we have an additional note there with an edge and then go with the second user do the same it goes back and then you can see the weight so this is a weighted graph that says how each item is in and it tells you how they're they're connected and so by the end you do this with all of your users you end up with a giant graph of how all of your items are connected and which ones are sort of you know leads in their traffic leads to other items when you have this graph you can do a graph what's called a graph embedding which would be to say ok we know there are a number of ways to do it but one of the ways which is the one they use this is called the random walk so let's randomly select a node in the network let's say 100 let's look at the outgoing edges from there by using their weights and we choose a one to go visit then we go visit that one would be 400 and then the same and then we stop at some point and so that's one sequence let's pick another nodes randomly and then we do this entire thing again and so we generate sequences like this just doing random walks and that's a way to read and encode for the structure of this graph in in a number of sequences now what you do is your Skip Graham I guess this and this is this was their their approach and then the rest is just the same and then you would end up with with item embeddings and you can use for recommendations they also go for a couple more steps so they tell you how to use side information to inform these embeddings so how can you use the description maybe of an item to influence an embedding so you know a couple of really cool ideas in there the third I think an in final example here comes from a sauce the the fashion retailer and I believe some students in empirical Imperial College here in London and they use embeddings to calculate customer lifetime values the they already have a system to calculate customer lifetime value but it works on a lot of features that were hand created by data scientists but they had a hunch they're like okay customers with high lifetime values they have a hypothesis that they would visit similar items at similar times and customers with low lifetime value visit maybe all together on sales or when a product is cheaper at the site than it is on the outside and it's very hard to sort of come up with a handcrafted way to capture that sort of information so what they've done is they said okay and look at they're laying the data a little bit differently here so they say for each item what is this sequence of users who have accessed that items page or screen on an app so this is no longer a click session this is a users who have visited this item and they do this with with all their items and then they would skip skip Graham against there these users and then you'd have an embedding for each user and then that's just one feature that they add they give to their model and this is the there are a couple more examples we don't have enough time unfortunately to to get into them but there are a couple in music record recommendations and gommi the music streaming service you know has a great blog post about how they do that for music recommendations Spotify there's a presentation from I think 2015 where they mentioned so they would use a lot of these shops would use ensembles and a number of different methodologies but they use this one to inform their related artists so you'd have you know playlists that were created by users you can skip Graham against these and you'd have related artists but they also use it for a radio when you use Spotify music you know click a artists radio or a genre radio they use this kind of method with a bunch of others as well if you want to go into the nitty-gritty and understand the probability and maybe some of the statistics that go behind this this is these were some of the best resources I was able to to find and get the giraffe's key book I hope I'm pronouncing that right is available for free online it's just PDF it goes into engrams and language models it goes into word Tyvek and then Goldbergs book is relatively new I also found it to be very accessible Chris mcCormick has an incredible blog posts that talk about word Tyvek in general but also talks about word Tyvek for product recommendations I wouldn't be doing the June theme service if I ended without talking about consequences doon was was published on 1965 and at that time it was this Wikipedia quote says that it really had people start to think about the environment because they really started to think about the planet as one system where everything is connected and they called it the first it was called the first planetary ecology now when a grand scale I looked for you know it says the first images of Earth from space I think it's the first colored images from space we had things in black and white but this is the first one that rolls Dana I think 1967 which led people to start thinking about the planet in in in the environment in a different way when we think about recommendation systems there they're pretty cool right they recommend films and movies but you know we can also joke about you know Amazon recommendations but you have to stop to think so you know that people watch 1 billion hours on YouTube every day and you know that 70% of what they watch on YouTube is recommended by their algorithms what does that mean humanity watches 700 million hours of video every day that were recommended by a recommendation algorithm I'll go transparency that org discusses a lot of this it was it's you know an organization run by a previous YouTube engineer that worked on these recommendations and sort of talked about the effect and how to monitor them I guess so 700 millions is a ridiculous number you know we have no context of what that is we need to pull in like an Al Gore type thing to see so ok television was invented 92 years ago telephone 140 years ago printing press 500 years ago earliest human writings 5000 Agricultural Revolution was 12,000 years ago behavioural modernity which is when humans started burying their dead and wearing animal hides was fifty two thousand years ago seven hundred million hours of video is about 80 thousand years that's how much YouTube we watch every day to put that into into context as well you know two-thirds of American adults get their news and information from social media and that fits into recommendation engines because a lot of these algorithmic feeds are recommendation engines they recommend content to you that is relevant to you and so there are a number of ways that you can think about this as harmful one of the ways that I was able to sort of find an example of is the World Health Organization warned that the cases of measles have increased by 50% last year one hundred thirty six thousand people have died from the measles last year so the trend is going upwards right and they they attribute the problem to a number of things but this is happening all over the world even in Europe and they attribute it one of the reasons is misinformation on social media so it wouldn't be far-fetched to say that you know at this point in time recommendation engines are a life-and-death matter Facebook wrote this blog post about some of their thinking and what they're doing to to come back a lot of this with elections with a number of different axes one of the interesting highlights in that blog post is this figure they said okay think about the different axes of you know content that can generate are racist content terrorism content misinformation of any kind they say if there's a policy line of where that content gets banned the closer the content approaches it the more people would engage with it so let's look at this let's say this is maybe a racism axis we're very harmless talk about races on the on one side and then calls for genocide or on the other side so you can draw the line here you can draw the line here or here it depends on you know each case but the weird thing is that wherever you draw the line engagements human engagement just shoots up wherever you put it it's it's incredible now would have never you know thought about this before you know learning about this and when you think about this so okay we're training our models on data if we're training our model on engagement data we're encoding we're telling them to push people towards borderline content and it is insane that we just blindly throw data and engagement data specifically at you know these recommendation models that are dealing with content and information and this is just a recent realization this post was November December so we're really trying to figure out these systems we have we're really you know feeling the the space what they're thinking about is that okay when content approaches the they need to start demoting it and recommending it less how does that work let's say this is the rail racism access so we know that engagement routes up when there's a policy line when we know that content either through a machine learning algorithms that flagged content or through human moderators when we have identified something that's on the right side of that line we remove that content so there's no engagement but then that's and that's not enough is what Facebook is saying they're saying there needs to be another line to tell where borderline content is so we need to train machine learning models that and and I guess people that are able to find borderline content but then what do we do with them we just don't recommend them we demote them we don't have to take them off the platform because of you know they're not illegal or against the policy but you know we're not gonna recommend them just about a month ago YouTube are saying are doing the same thing and so that's one of the the ideas that we're adapting to do these recommendation models so they're like okay things you know false claims or phony miracles or claiming you know earth is flat we're gonna do most videos like this they say this kind of content accounts for about 1% of what's on YouTube which is a bit percentage that's 7 million hours of video today so that's one of the ways that we're we're figuring this out you know it's it's a little weird because this is not what we signed up for when we go into you know software we don't think about you know genocide and and freedom of speech on the other angle but there's an actual saying that software is eating the world and with that software become problems become planet-wide problems that's why one of the my favorite examples and I close with this is full facts which is a uk-based fact-checking charity so they have people who fact check news but they also develop technology to do that they've partnered with Facebook in the beginning of the year to you fact check a lot of the content on there I have a great talk about one of the ways that they're using to automate fact I can summarize how it does it in one word for you embeddings thank you very much [Applause]
Info
Channel: InfoQ
Views: 8,831
Rating: 4.9633026 out of 5
Keywords: Artificial Intelligence, Machine Learning, NLP, Natural Language Processing, QCon, QCon London, Transcripts, InfoQ, Enbedding, Word Embedding, Language Modeling, Skip-Gram, Language Model Training, Airbnb, Alibaba, ASOS
Id: 4-QoMdSqG_I
Channel Id: undefined
Length: 51min 0sec (3060 seconds)
Published: Wed Jul 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.