Recipe2Vec: How word2vec Helped us Discover Related Tasty Recipes | SciPy 2018 | Meghan Heintz

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Megan Hines I'm a senior data scientist at BuzzFeed my talk today is rescue Tyvek or how does my robot know which recipes are related this is basically going to be a case study in how we used recipe Tyvek to create a consumer facing data product for tasty so if you're not aware of tasty this is basically BuzzFeed's cooking brand there are these like top-down sped up cooking videos that you've probably seen on Facebook this basically came about when Facebook made their algorithm change that started biasing towards video but see data scientists did a bunch of experimentation and figured out not only was Facebook biasing towards video but very specifically 292nd long videos so we figured cooking was a great format for that and we basically exploited the hell out of that and grew a huge fan base so we expanded to from our Facebook page to YouTube pages snapchat and Instagram this was really really successful for us general manager of tasty always says we're like the biggest cooking brand in the world right now I'm not really sure which metric she's using to measure that by but it certainly sounds nice but this kind of had two problems for us first of all I as we were able to exploit that initial Facebook algorithm change we were incredibly vulnerable to kind of the subsequent algorithm changes so we didn't really own our users they were like all out on these distributed platforms and the other side of this is that we knew that both through like looking at our own data and through user research that people were actually trying to cook these recipes and it was a pretty miserable process for them so we would see that people were like scrubbing and watching little sections of the video over and over again as they were trying to kind of desperately jot down the ingredients and the preparation steps so that was just not great so we decided to remedy that and build to new destinations so both our iOS app and our website so that we can kind of control that experience and like give people a really easy way to come in and find these recipes and actually cook them so we knew kind of from the beginning that we wanted to create a related recipes and we wanted to do this for two reasons one kind of selfishly we wanted to be able to recirculate users from our most recently published new recipes to our older recipes and then for kind of the user centric reason we saw from our own data on BuzzFeed coms video pages that people would tend to look at the same type of recipes in a session so they would look at a couple of dinner recipes or a couple of meal prep recipes or a couple of dessert recipes so we kind of understood that when someone is looking for a recipe they did know generally what they wanted to cook and that we could help them out by kind of narrowing that potentially exhaustive search to a couple of like related recipes for them so that kind of begs the question why do you even really need a data scientist to work on this don't we have video producers and chefs that can tell us which of our recipes are related turns out not really if you have ever worked with user-generated data you will know that most the time it is utter garbage we worked to get our producers to tag like what they thought was a dinner recipe or a breakfast recipe or comfort food or dessert and ended up being like giant a giant giant giant mess my favorite example is like really awful human tagging is we had this like list of mocktails which is like yes technically they do not have alcohol in them but I certainly do not want you recommending that my child drink a Moscow Mule another one that like drove me completely insane was we had these little like cocktail teriyaki meatballs that you would have it like a fancy like cocktail party and they were all tagged as breakfast I was like really who's eating us for breakfast so we know that we can't really rely on humans to do this so we're gonna have to figure out some way to do this programmatically the first kind of thing we thought about in order to kind of categorize this data was use the list of ingredients so we thought that maybe we could use these as categorical variables so first we thought you know maybe we can use Demi coding you might have used this before with pandas bucket dummies which basically takes all of your categories them into a binary dimension so zeros and ones for whether or not the ingredient is present in the recipe the other way that you can do this is label encoding or you take your list of categories or words and you convert them into a list of numbers both of these methods kind of have issues for us and get dummies you're adding a dimension for every ingredient thus increasing the dimensionality we have the cursor mentality our data becomes sparser and sparser it's harder to conversion solution for label encoding we have a different issue we're kind of implying that there's some type of order or ordinality in our world and there's nothing in the physical world that tells me that Apple should be 2 and yogurt should be 3,000 both of these methods have the same issue that we had a lot of different ingredients in our recipes space that were very very similar what would be encoded completely differently so things like 2% Greek yogurt and Greek yogurt would be two totally different columns with the dummy coding method and it could be totally different numbers with the label encoding method and not even close together so there might be like 10,000 or like 12 so there's some other more advanced techniques like polynomial and helmert but none of them really like solve these issues for us so we're not going to move forward with this method instead we're going to take a word embedding approach and forwarding beddings you have this raw text of corpus text corpus in our instance recipes and you create vector representations the words in the corpus so in this example eggplant ends up being this list of small floating-point numbers the kind of intuition behind this is that when you plot these vectors in dimensional space the similar words will end up spatially closer together then the dissimilar words so like dolphin and porpoises are close together but paracin camera are relatively far apart there are actually a lot of different ways to make word embeddings but they all kind of coalesce around the same idea that a word is characterized by the company that it keeps one very simple way to make them is through tf-idf term frequency inverse document frequency which basically looks at the frequency of the term in the document versus in the rest of the purpose kind of gives you an idea that does not that important but maybe the word mango is more important there's another family of word embedding methods that are word toward co-occurrence matrices the most famous of which is glove which was developed at Stanford it's a log be linear model with the weight of the least squares objective and then there's another family of them in neural networks the most famous of which is word Tyvek which is developed at Google and it's essentially a two layer neural network that is actually the one that I ended up using word Tyvek has two different implementations one of them is Skip Graham and one is continuous back of words we're going to talk about skip Graham today and I'll kind of throw in how that's different from SIBO as we go through it so how we actually implement this is we take a sentence like add sugar vanilla and salt and then beat until very smooth we decompose this into context words and target words your context words have like a window so in this example we have a window of one word but you probably have something like five or ten words that you're looking at so ad and vanilla end up being the context words for sugar sugar and and or the context words for vanilla and so on and so forth each of these words is initialized with a random vector of very very small values you take those vectors and you try to protect the context words from the target words using a soft max regression classifier which is basically a generalization of logistic regression so in continuous backwards it's actually the opposite you're trying to cut trying to predict the target words from the context words these are basically like tricks to generate additional training samples for your model so the first time around when you make this prediction it's going to be complete garbage because you have totally random vectors but that's okay because you're gonna go back and update your word vectors take a small step in order to maximize your objective function using a stochastic gradient descent which is basically a hill climbing method and back propagation which updates the weights in your neural networks layer you're just going to do this over and over and over again until you reach some type of stopping criteria be it maximum number of iterations or your word vectors stop changing very much and this is kind of what it looks like from a neural network standpoint you have your input vector your hidden layer of linear neurons which you're updating with back propagation and your output softmax classifier which gives giving you a probability of each word the way that were to Veck is different from the other neural networking implementations are as these three things word pairs and phrases subsampling and negative sampling so in word pairs and phrases previous implementations had just treated every single word as a completely separate entity so the canonical example from google's paper was boston globe' now has its own word vector rather than being two separate vectors for boston and globe which is important for us because obviously that it has a very separate meaning together than it does apart the other thing that happened is they implemented sub sampling which basically takes all the really frequent words and decreases their occurrences and the training samples because it really doesn't help us that much to constantly be recalculating the vector for though another thing that happened was negative sampling so before once you made a prediction you would go back and update the word vectors for all the words that were not the predicted word and that just take takes a lot of time and a lot of computation so instead we're going to just randomly select a couple of words that are not the predicted word and update those or rather than updating all the vectors for the entire corpus and then obviously you also update the vector for the positive word which is the word you should have predicted so once we have these word embeddings how do we actually go about evaluating them these are like 70 100 200 dimensions long we can look at this cube and see that like as humans were really not even able to evaluate three dimensions so we're probably gonna start with some type of dimensionality reduction technique and there are a couple of different ways to do this I think earlier in the conference we learned about a new one called you map but generally we have the matrix factorization methods like principal component analysis or PCA and the neighbor graph method it's like t-sne I kind of find pca a little more intuitive but we're actually going to use t-sne for this method and that's basically because it's been found to be a little bit better visualization it won this kind of prestigious Cowgill award and it kind of solves this thing called the crowding problem or all your observations end up in the center of your plot now it's not a silver bullet though it can be a little difficult to interpret these plots first of all high parameters really really matter perplexity which is basically the knob that tunes whether or not something is a neighbor can really drastically change the output of your visualization so you're gonna want to like maybe look at a few different versions of this even with the exact same hyper parameters every time you make one of these visualizations it's going to come back a little bit different you also don't really want to over interpret these plots so the size of the cluster and the distances don't necessarily mean anything so I went ahead and after training my or Tyvek model on preparation steps and my recipes looked at the word embeddings for I think the hundred most common ingredients in my data set so I apologize the font is a little small here but in blue we see all of our different like pastas like spaghetti and then Guidi are together in this purple all of our alcohols like bourbon and gin and whiskey are together and red is red and coffee and up in the top in green is sage rosemary and thyme so we can kind of see that our model is learning something interesting and useful about this food space just from the fact that I'm a human and I know that pastas should be closer together but that's not really enough we want to look at this from another perspective as well so we're gonna look at the cosine similarity of these vector embeddings in the original paper the relationship that was used to kind of tune this model was looking at countries in their capital cities now in the food space we don't really have these like strongly defined relationships where I know like how similar an apple should be to an orange but I do have an idea that salads should be much less similar to cake as tort and guacamole should be much much less similar to similar to chocolate as it is to Coco and ganache so once I've kind of played around with this and I feel confident that my model is learning something interesting about the food space what do I do with them I don't want related ingredients I want related recipes the nice thing about word embeddings is they're kind of modular you can kind of add them up some of them and they retain a lot of the same meaning the kind of canonical example of this is of course king - the vector for man plus woman is very similar to the vector for queen so we're going to use that same concept and we're gonna sum up all the word embeddings for all the words and our recipes preparation steps in order to create our recipe vector from there we're gonna go back to t-sne and evaluate how our recipe vectors are looking so I know that I said that humans were really terrible at tagging but we are showing some tags from our producers we're healthy is in blue comfort fruit in yellow deserts and green and alcoholic beverages in purple we can kind of see that deserts in green are smeared across the bottom comfort food and healthy food and you're the top they're kind of like mixed in together because to be honest none of tasties recipes are that healthy so I'm staying healthy ish but we can kind of see that like we ended up with like all these eggnog recipes together kind of we've got our meal prep recipes close together we have a bunch of like kind of duplicate recipes like turtle turtle brownies and they also ended up together kind of our meal prep recipes were together so we feel pretty good about how these recipe vectors are looking so we're gonna move on and we're actually going to productionize this and use this as a module on our website so the way that we actually productionize this is we had our my sequel database with our structured recipes we have a cue reader that uses the jinsim implementation of word Tyvek that pulls in our preparation steps trains our word Tyvek model we create our recipe vectors and then from there we use cosine similarity flight to find the 20 recipes that are most similar to each recipe we store that in Redis which is our key value door tasty API retrieves that and then serves it to tasty Co and tasty app we're publishing about 15 to 20 recipes a week you know video production takes a while we apply this model every time a new recipe is created to the stale model and then we completely retrain the model every 12 hours so this is a what it actually looks like on our website today I was really excited when I saw these sample results we have a lot of these like cheese stuffed in meat recipes and we like found all of the cheese stuffed in meat recipes it's not really what I would like to have for dinner but I think a lot of our fans love these recipes previously we probably would have just had to rely on tags in order to populate this module and I'm gonna tell you that about half of these were tagged as healthy so you can see why that would be problematic another example is this like spring berry pie we found all these kind of like meringue cheesecake II like fruit desserts to show up previously we all we would have been able to do is show desserts which at least was a little more accurate than the healthy tag another example with like this broccoli rice stir-fry we've got all these kind of like weeknight like rice and meat dinners together all of our alcoholic beverages are grouped together there are some like weird quirks in this though we do have like things that were very similar with respect to their preparations tips they're actually quite different in their use so like this Pina Colada recipe is very very similar obviously to these smoothie recipes but I would say with respect to user intent we are much closer to the peach Rosarita or the berry vodka sunrise one of the other things that came up was actually this gazpacho recipe was also very similar to these smoothie recipes which it is basically like vegetable or like a fruit that you like blended in a smoothie so it makes sense that that would happen you can kind of just use some simple heuristics but in general this was like way better than we ever could have done just using human tags so these are kind of some other ways that we either are are already are or investigating using taystee so we can try to predict the performance of new recipes based on the performance of older similar recipes create kind of more context aware recommendations kind of combining collaborative filtering with these recipes similarity metrics making recommendations to reducers about what types of recipes they should be making and generally they're very useful for in features for our various machine learning applications so that is the end of my talk to a unveil for questions BuzzFeed is also hiring if you're interested get in contact with me what's your topic is your training set and what's the dimension of your vector space and what's the dimension after you to the dimension reduction sure so we have about three thousand recipes we augmented that with another ten thousand we have seventy dimension vectors we kind of played around with different dimensions for that I had just like these giant spreadsheets where I was like pulling my co-workers on what they thought was a good result and I didn't see huge differences moving from like 70 70 to 100 so 70 things default from words of X paper and we felt that actually worked pretty well I feel like there was another part of your question though yes after the dimension reduction just two dimensions just - yeah and another question for the word to work model the the weights of the neural net is updated after each gradient descent and the the the vectors are also updated so I'm I don't understand why this thing actually converge or does it converge so it more like the vectors stopped changing very much or you say like I'm gonna do a thousand iterations and after that I'm done okay sorry okay yeah thank you if you guys thought about adding nutritional information yes that's a name and it's a big project this summer we have a lot of people who are working on that right now we kind of initial east shied away from that because we're looking at these recipes with like a pound of mozzarella cheese and we're like yeah it's a lot it's gonna be a lot of calories so you're kind of scared to initially add that in we were working on it right now we have a product or a product design intern who's doing mock-ups and we're looking at different sources of nutritional databases like there's a couple that New York Times uses and we're trying to figure out which are kind of accurate enough ok we'll rotate one over here how much of the recipe article is brought in through the word Quebec algorithms I know a lot of what recipe sites will have like a couple paragraphs before kind of just chatty about the recipe that don't talk about technical stuff but might impart some wisdom in terms of the user intent like you're talking about without garlic beverages so we're primarily video so we actually don't have those like long like paragraphs about the recipe and it we actually kind of consider that one of our strengths is like I know I go to like sites and want to look at recipes and I'm like well you're I'm like reading out your whole life but I just really wanted to make these muffins so right now we just use the preparation steps okay hey thanks a lot um this is actually a kind of a related question I'm curious when you split the the ingredient list into these contexts and target words is that a manual process or do you do that so it's actually not the ingredient lists I guess the preparation steps yeah we had initially like had all these like random plain text files of these recipes from the last like four years and we structured that into a database for the purpose of making the app and the website so we it wasn't a manual process we just used my sequel and then kind of bunched them in Python but you need to specify what you're contacting what's your target no that's all kind of handled by Jenson's emulation yeah and you like one of the important parts of that is setting your window size so you could say like my con text is five words around or ten words around and that's like an interesting hyper primitive play around are you trying to predict the next word is that so you are in there's two implementations there's continuous bag of words and skip gram and skip gram York you're trying to predict I believe it's the context words from the target word and then the opposite they're like two different versions where you're trying to predict the target word or you're trying to pick the context words okay we'll switch again over here have you seen an increase in clicks on recipes since implementing the algorithm so this is a launch feature so yes we had zero clicks we initially took this approach because we were building a brand new website so we didn't have any historical data so we have other recommendation engines at BuzzFeed that are kind of more typical item to item clever filtering but we chose to take a purely context-aware approach because we didn't have any historical data hi so I see that you took kind of like an NLP route to trying to solve this problem and I'm curious because there's tons of ways of being able to vectorize words what made you settle on word avec was it just the ease of understanding like how to use it or have you tried multiple methods up to leading up to then that was just the best performing one I looked at a couple of different methods and to be completely honest Jensen's implementation was just so easy to use it was like very much like path of least resistance forward I've kind of thought about using other methods I have an intern right now who's looking at doctor vac and another intern who's looking at glove for different implementations but at the time it just like worked really well and we were happy with it and since we're business that were like tons of other fires to put out at the time fair enough thank you a very interesting idea I seems you are trying to credit virtual whacking using the Razzie recipe copper's right mm-hmm and do you think is that reasonable because in work your work you learn from the contacts so basically the word has a similar contact has a similar but like for example in recipe for sugar you can have contacts in coffee and malinga but you can also have the contacts like fried pork I believe you have something like in Chinese food like the three sugar part so do you think that will create a similar vector for sugar so there is context in that we're using the preparation stuffs and not the ingredient lists but we do have a very like specific type of recipe that we were training on so I think you're right if we like added some recipe from a totally different culture probably our vectors would be pretty far off but luckily we were only making vectors within this space where there's like the same like 20 people who are writing recipes okay we have time for one more question before we have this switch over okay great talk I was wondering if you're tracking clicks through the website to like validate whether these recommendations are actually related or not yep yeah we track everything well so the one of the things is like we have that module and then we have a trending module and we really only show the training module if we don't have similar recipes so what we are usually using to try to understand if this is working is the recirculation rate and the exit rate so how often is someone actually clicking on another recipe or how ever are they leaving the website entirely and right now it's contributing a pretty substantial portion of our traffic so pretty happy with it I think there's still like a lot of experimentation we could do to make sure this is really the best way to do it but right now it works well all right with that lets thank Megan for an excellent talk
Info
Channel: Enthought
Views: 1,263
Rating: undefined out of 5
Keywords: python
Id: RTyHP_PiX9M
Channel Id: undefined
Length: 26min 23sec (1583 seconds)
Published: Sun Jul 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.