Robert Meyer - Analysing user comments with Doc2Vec and Machine Learning classification

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay well thank you very much especially thank you to the PI data team for having me here so can you hear me all right yeah no it's like it's okay it's okay yeah so hi I'm Robert and I work as a data scientist for flix bus and usually what I do there is optimization and machine learning and in my spare time sometimes I do similar stuff but with a slightly different data and that's what I'm going to talk about today some slightly different data what I analyzed is a user comments so comments that were placed by some people add some articles they may or may have not read online for different news outlets and basically the question that I'm trying to tackle here is what can we learn from user comments on use sites I guess most people would say not much if any I mean have you read them but I'm trying to tackle this question not if we can draw some facts or wisdom out of these but more from a data science perspective so can we uncover some some structures or some patterns in the data and so I first briefly will talk about the way I actually scrape the data so where I got it from and how much actually it is and then the main part of this talk will be about dr. vac so like this neural net our own network method for word embeddings and of course I'll actually start with words of AK which is the basis for dr. Beck and in the very end we'll briefly also touch some supervised machine learning actually on the output or on the document embeddings of the doctor VEX Network okay so and these are the three new sites where I gather the data from so decide Spiegel Online and focus and so there's an international audience here but I guess there's some people from Germany as well so who's familiar with that with these new sites some some oh it's actually ok that's that's plenty and that's good because actually when I looked at the data I had some working hypothesis in mind which kind of looks like this so because I'm in this under Halle the yeah undoubtely did it there's a there's a gradient in quality in any articles themselves so I thought there might actually also be a gradient in the news or comments so like some slightly smarter comments made on site online and like all the hatred and racism planning up at focus so it will later on take a look at whether we find some evidence for this hypothesis in the data but you have to bear with me until the very end of my talk and okay let's first talk about how I gather the data so this is the screenshot of like the comments section of each site and it'll basically look like this so people can put their comments there and what is nice about that though is if you look into the HTML source code then actually the common pop out in the source code which is super nice because then you can use something like the request library hi Alex MLG basically just parse the HTML source code and then you can gather and get the comments just right out of the source code and I did those so I gather quite a lot of user comments so about two hundred eighty thousand from Spiegel Online 170,000 from sites and roughly 50 thousand from focus and they were written for articles between January 2014 and June 2016 so roughly a year ago okay and so of course I did some very very brief pre-processing of the data for this I use the natural language toolkit and and I didn't do much for pre-processing so these are three actual comments and how they how they look like in the raw data and so for instance and as I said it's a it's German news site so of course the comments will be in German but I try to translate them as good as possible and so for instance here is I just going to read the top one from focus which basically says something something something teach unity and justice and freedom for our German fatherland at our schools and so I took these comments and you only prefers the thing I did which basically lower casing them and removing punctuation that's it so no stemming or whatever just putting them basically in Tison lists and for now and let's forget where these comments actually come from so in the first processing steps I don't care that the first comment was actually placed on a news article by focus let's just label them with some individual labels so let's say this is the first document that second document and the other one is the third document and so on so that every document has a label but for now we don't care from which news site the user comment actually stems from okay so then I have this data and I thought okay I want to do some cool stuff with this I'm just going to throw it into a doctor vac Network and which is like super cool deep learning library for word embeddings and text processing and then some amazing stuff pops out at the other end and actually what turns out I got two things wrong with this so the first thing is that is actually not deep learning it says so on the Jenson page so the library I used it says deep learning but there's nothing deep about it it's just three layers right there's an input layer hidden layer and output layer so it's more like a normal artificial neural network and the other thing I got wrong is the amazing stuff doesn't pop out at the end it actually squeezes out at the site and never mind still there's amazing stuff happening how that actually works I will talk about that in a minute but let's assume that we already have this amazing stuff so what is this amazing stuff and this amazing stuff for now let's focus also on words of AK so because dr. Beck gives us basically word embeddings and document embedding its first but just focus on words of X so the basis of talk today so what is the amazing stuff that we get here and so what will this will give us some so called word embeddings so basically this is vector representations of words so for instance let's say we have a train network on a like diesel corpus like Wikipedia and then we would for instance gather something like this so we would get an N dimensional data so here I just chose four and since four is pretty bad to draw I draw it just in two dimensions so we have this vector representation for a word let's say hamburger and then this is some vector in the space but then nice thing is we can do some operations on it so first of all we can actually compare it to other vectors so we would for instance figure out that this is the representation of the word cheese burger which is very similar to about hamburger so they encode some sort of meaning and these two are quite similar but they might be very different too and this is now very subtle very different to the word clicks plus and so this is something nice about these vector representations so we can actually compute similarities among words and for those who are actions it's in a master basically what you usually do is this cosine similarity which basically is dot products divided by the product of the norms of the vectors ok but that's not the only thing you can do with this actually you can also do mathematical operations and probably you've seen this example a lot a lot because everyone uses that so I have two two and so actually you can add and subtract these vectors so for instance we could use the vector presentation for the word King subtract the word man add the word woman and what we get is the actual vector representation for the word Queen this is super nice I mean like this is amazing stuff that pops out of these networks okay so let's quickly look at it how it actually works and so basically what a words evacs Network done is learning or to predict words from given context right give me a context of words and I predict you what is the most likely word that occurs in this context and and for now I'll let's let's look at the the network so basically and for now let's say the context is very simple and the only thing we use as the context is basically preaches a precessing word so what is the first word given the first route what's going to be the second word this is basically what we're going to train the network on so this is how the simple version of a word svet network looks like so if you look at the input layer and this is usually fairly big right because the input layer has one neuron for each and every unique words that you have in your data set in your corpus so for instance the the little cellar at the very top with x1 might encode the word unity its neighboring your room might encode the word justice and the next one freedom and so on and so forth like this middle feller and so we have an input neuron for never unique word in our data set and if you look at the output layer here on the right hand side it's the very same encoding so basically the same words that appear on the input layer also appear on the output layer so the first neuron might encode the word unity the second one the word justice and so on and so forth okay so what we now train the network on is making predictions so what is the most likely word giving a particular context and as I said as context let's say the only thing we care about the first word and so if we would train this network it would work kind of like this so we have this word pair Irish guides and wouldn't so basically unity end and and what we train the network on is we would use unity as an input and because every every one of these fellows encodes for one particular word in our corpus the input vector is basically just a single one for this x1 this little fellow that encodes unity and zero for the rest so we would put that into the network then propagated through the network to do a hidden layer up to the output layer then some mathematical gibberish happens in between and we get out some output vector and if this is our first pair of training data so the first pair of words that we have unity N and then basically what we'd get is just random gibberish but what you can do now is actually compare this random gibberish to the actual target value that we try to predict because we can compare now this output with the actual target vector and because we're training on pairs the target was actually the word and right so the target is as you can see on the right-hand side just a bunch of zeros and a single one for this output neuron that encodes the word and and now we can compare these two value of these two vectors compute some error measure and then propagate the error back to the network and adjust these weight matrices which are part of the intricate mathematical operations that happen in the middle such that the output vector resembles more the target vector and we could do this back and forth back and forth now with our entire training data so the next pair we were trained on would be the word n and the target would be justice and we would basically do a couple of sweeps through our entire training set so all the all the documents all the user comments that we have and train the network we would always adjust the weights in order to make the target vector and the output make the output vector more look-alike so after training the network what do we get okay so let's say we now did several sweeps through our training data and we have a trained network and if we now put in again the word n for example we chuck it in here so this is now the actual mathematical operation so we have like this linear operation where we multiply the input vector with this weight matrix and then there's another linear operation and on top of that there's a nonlinear operation called softmax but what is the security details but what it actually gives us is the probability of another word given the first word so here it would tell us okay what is the probability for instance for the word justice given the word end right so it's like 10% here after training freedom it's also 10% hand is less likely so it's 5% in and that's that right so this is AI now we have a train Network thing is so where's the word embeddings now so if you remember I told you that the actual amazing stuff doesn't come out at the back where we get this probability descriptions but actually it squeezes out that site so if we look at this weight matrix this is actually where the word embeddings can be found so this weight matrix has as many rows as we have unique words and this means for each and every words like X 1 2 X V we have the corresponding word vector now in in this matrix so we could just for instance look up the case row which would be the vert vector for the word n and we would get our vector representation and I mean in the examples I used before that with four dimensions but usually you choose something between 100 and 300 which is also the number of hidden nodes in your network okay so this was like words of AK in a nutshell and a very simple version the actual thing is slightly different our site is slightly more complex but it's actually pretty much doing the same thing because so far our context was just like the preceding word what's the probability of the following word what you actually do is you train on a more broader context so for instance what we actually want to do for instance is something like I want to predict the word and given the context of and just as freedom and for but it's basically the same thing so we just put the context at one end of the network and we get try to predict the middle word so to say on the other end and still the same thing holds so we can look at the word factors in these matrices okay so this was word to vac now let's make another step and look at dr. Beck and actually the step from words to vector dr. Becker's is rather small and so now we have word embeddings so we have vector of presentation for words what we're after is actually vector representation for entire documents and these can be arbitrary complex so in our case this is user comments but it's not limited to that so it could be for instance entire books so we could have for example I don't know like a train network on like literature and then we would for instance figure out that for example Alice in Wonderland the book is pretty similar to through the looking-glass but which is very different to the vector representation of that capita by Carmack so so how do we now make the step from words of AK to dr. vac as I said the step is actually pretty tiny so it's almost the same network as before so we're going to train the network on let's have a given context please predict the word in the middle and the only thing we now add is we have the context plus document Tech right so we have the context and the document Tech on the left hand side at the input and just the prediction of a word as an output so so because that this was the very first user comment and which had the doc document labeled doc one this is basically the input then so we want to predict the word and given the word justice and freedom and given the docume Tech number one and the beauty is then of course there's again a weight matrix we can look up the case element in this matrix to get the vector embeddings for a particular document the case document and and that's the nice thing so we have now these document embeddings but also we have the word embeddings for free so we get them too so basically this gives those two things at the same time document and word embeddings okay so this was like a very brief and quick overview of how dr. Beck works and now let's actually look at what comes out if I use the user comments that I scraped on these online news sources and if I train a dr. vac Network of course for that I use the the amazing Jennsen library super-nice where they have a pretty pretty nice implementation for python and so I used all user comments struck them through the network a couple of E+ it ran for a couple of hours and then I had my document and word embeddings so let's first look at the results first let's look at the actual word embeddings we get out so I first checked if my network actually picked up something meaningful or if the whole computation time was just basically just producing garbage and so the first thing I looked at okay so and now we can ask of course the network for the word embeddings and we can ask the network what do you think is the most similar word for instance to the word car right so we look at the the word embedding that is the closest to the work vector of car and the network response okay so what I think is the most similar to car is actually awesome a while that's very nice so this makes sense so car ultimate it's very related and here in brackets to see the actual cosine similarity which ranges from minus one to one where one is like perfect alignment order or perfect similarity okay so car autumn all right that makes sense but again we're looking here at user comments so let's ask a little bit more nasty questions right so dear beloved doctor VAX what is the most similar to fake news right germán actually fake news as a single where new gun classes so that's why it works here and this is what it spits out so it thinks the most similar words of fake news is actually good Minj right so I translated this do-gooder so to this diminishing term for people who actually think it would be very nice if we are kind to each other instead of and so super nice and of course you can also look at more results so a call second is the poutine fresh daya so which I translate at Putin's disciple right and on third place its conspiracy theory okay and we can look at more so because as I said the only pre-processing I did with lower casing everything and just removing punctuation so basically abbreviations are basically words to that thing so NPD that's a German horrid a horrible German party that's like the new Nazi Party so we can still ask Network what do you think is the closest matching words to NPD and it's CDU so that says this is a German Christian Democratic Party close second is actually the CSU which is they have a very impart and and that this is a bit odd but I mean this is numerix and maybe actually it's true a third place is the FIFA okay let's move on so this was so far regarding similarities as I said what's the most amazing stuff you can do is actually doing mathematical operations like addition and subtraction of vectors so for instance what do you think is brexit - england plus greece so if you now thought Greg said I unfortunate to disappoint you the actual the person's want his hair cut however grexit is the second closest one okay but still very nice brexit - tinglin plus crease is almost corrected okay so let's get nasty again so what do you think is that its air one so Hitler plus three teen is there okay let's do one one final one because I mean as I said this is like the example everyone uses if you have a decent data set so what do you think is king - man plus woman right in a decent data set that's Queen no questions asked but again we're talking about user comment and it's angular so she said she's the queen of Germany I'm not making this up actually came out of his Network okay so let's move on and spend the rest of the time now doing machine learning because so far we only looked at the word embeddings that spun out of this network and now let's look back at our initial hypothesis and say okay give give me a random comment um I want to tell you from which news site it originated right and so basically this is the pipeline here so I start with the raw slightly pre-process comment I chuck it into dr. Beck I get the document vector out of it then I take the document vector put it into a machine learning classifier and the machine learning classifier is supposed to tell me where the comment actually originated from so where it was made so is that actually a sensible does that task makes sense right it could be super super hard and the truth is it is a very hard task because there because most of the Commons are just one-liners so something like this so it's clear that these things would come is a comment so and this could have originated everywhere so it's actually a very difficult task in this particular case it actually came from the Skip site but it like there's lots of these one-liners for all for all of the new sites so how good are we actually doing so I used a linear classifier stochastic gradient classifier form scikit-learn obviously again as I said dr. vex embeddings as inputs and outputs should be the class label side chiffon focus and and I didn't use my entire training and my entire data set I used stratified training and test sets so that I had the equal amount of tests and training data for for all classes for all of the three classes so about 35,000 comments for training and 15,000 comments for testing so how good are we actually doing fairly okay so the training accuracy is around 60% and the test accuracy so with novel data it's roughly 50% which is okay I I mean random guessing would be like 1/3 33% we can also look at more details like what's the confusion matrix here so in the training data this looks very nice in the test data it's a bit more messy because it classified like site as sort of a default but still there's some some pattern in there still we're doing like 50 percent and as I told you it's actually a hard task right okay so now coming finally back to the hypothesis we formulated in the beginning and in order to look for evidence for this hypothesis what we can do is actually ask now the machine learning classifier what do you think is the most prototypical comment that represents best a particular class so like what is the typical focus comment what is the typical spawn comment what is what have you learned there so in and we do a little guessing game you know so this is the best representing one for one particular class and you got to guess which class that is so really the English part which I try to translate but it was really difficult and so nor is it just too tight for living things beyond man but ultimately even for ourselves because it is less anthropocentric as it implies the very idea of light a very narrow idea of life so this sort of like pseudo-intellectual poetry originated any guesses yeah okay so this is the best route so this is the typical user common under a site news article what so that's what the machine learning classify things that's like pure sight okay so it's next moving on moving on next one so the manufacturer doesn't get the money saved on an old toaster therefore the breaking point is placed where even the expert needs the Select saw so this is the more sensitive comment actually saying that things break on purpose so you can sell more stuff like phones or here at salsa any any guesses so that's the best representing for which class nitrox it's it's regal online oh yeah so this is going to be easy now so so this is the best representing and it reads these misogynist Muslims understand only a heart hand and have to be reported immediately if they reside as refugees here this is what Mrs America got us with her open borders policy okay yeah and I mean that you can you can go like the top 20 top 30 and they all look like that trust me but this is like enough of that so so but this is basic in my talk let me do like a short recap what we did so I started with scraping comments from HTML source codes from these three different news outlets we used in these data to Train doctor deck network on these user comments that I scraped and we uncovered some interesting semantic relations such as this one and also then we use the doctor veca and as an input to a machine learning classifier with some reasonable performance so it could reasonably well classify where it originated from and actually if we look at the prototypical examples no no doubt that this actually holds holds here at this point so that's that that thank you very much any questions hi thanks for the toggle it's really great when you showed the training for dr. Beck it seemed like the documents and the word vectors live in the same embedding space is that true does it have any mail actually they're not they're not you need some concatenation or addition operation but that they have two different weight matrices got okay thanks that was great um how come you to do more Pro a pre-processing so removing stop words or stemming motivation yeah that's probably some some good idea to do because that would reduce the like basic vocabulary of by a lot but on the other hand this is user comments so like typos are an intricate thing of that so if you would stem a ways of that stuff you probably lose some signal in terms of like people doing like arrows a lot like there's like a shitload of errors in the user and the user comments so but yeah I could it's probably nice and try that too nice presentation thank you so I was I was having the question about what kind of function did you use for this who is who topic like how did you how did you find which is the best comment in that particular document I just I just used all the training documents put them into the classifier and the one that was like the highest rated for that class that was the particular comment so the one that is where the machine learning classifier is most certain that this is part of that class okay okay thanks I think regarding the pre-processing and the errors have you any idea how you could encode punctuation because that might be really reviewing for like sight or focus yes yeah I thought so too because they do use a lot of exclamation marks that's that's true I've no idea I mean the thing is if you leave them in there you're like the vocabulary explodes right because it's difficult so I have I can't give you an answer here if if you have an idea please let me know yeah welding majors tend engineer features like counting the number of commas per words or so actually not know so I don't have any tests with base lines so probably this classification task can with standard methods pretty much be improved like if you do tf-idf plus a linear classifier or nonlinear classifier you probably would get better results I was just curious of how the doctor deck works and if we can actually also use the output of that as an input to a machine learning classifier yeah but there's probably tons of methods that would outperform that by a large margin Thanks thank you very much for the talk I just wanted to ask you about dr. vac so in this process of converting the document to vectors you have all your documents that you will do both training and testing on and I was just wondering if there is a way if for example now you have a new comment could you convert this comment somehow to to the vectors because it is yet but okay yes you can and actually that's a very good question because I didn't use so the test set that I used in the end for the performance measurement and they weren't part actually of the training set for the dr. back so I try to separate these to the very end so what you actually do is you chuck in and like the whole context different context in that document and and you know the outputs and basically what you do is a gradient descent and on the weight matrix for this new document and then you find basically the closest representation due to this gradient descent like eight or ten steps so yes you can do use very novel data and to compute document embeddings for that too that's actually with again what I did for the test set thank you yeah I think so doc comments usually are in the context of the articles do you have also information about the articles themselves like the headline or the text and type I lay around release connection I haven't looked at so it's actually I have the data so I also I think it's great the title and like a short description of the article is for but I haven't looked at that now and I mean like some assumptions are you violated because there's of course a correlation between between comments made on the same article because they at least they try to respond sometimes to each other but yeah but but I haven't looked at that yes yeah when we would be fun to generate the most that the focus has respondent the site comment most probable for some article or something like that yeah so I try to randomly select articles so not to get a bias by topic but I mean like the topics covered by these negative buddies use all it's a slightly different so some of the results might might stem from that yes but I mean during training and testing I try to make sure that all the comments and the test said we're like we're from articles or that haven't been used in the training set sorry I didn't try to mix these but still there might be some signal in there I thanks for the talk so my question is are you planning on offering a Web API for this awesome tool that you can that you can use that you can use for recommending users that might have landed on the wrong new site and unfortunately you know because I contacted site and Spiegel about it but they never responded so I wouldn't I wouldn't publish that without their permission so maybe I can ask them again but the emails I send them just entered the void so maybe they didn't like the idea at all good because I mean they don't they don't look so good in the comment section as you actually compared your results here with a baseline now I have a simple one so as I said probably tf-idf plus classifier would outperform that a lot but there's more like experimental include that was not like scientific rigor I mean the hypothesis formulated is like not a very science e1 in the first place right but yeah it would actually what we did in a hacker day at six plus we try to use this exact set up for classifying emails so when people complain like the Wi-Fi don't work or I lost something on the bus that needs to be classified and sent to the correct guys to actually handle that and so we tried this and it sucked a lot in comparison to tf-idf plus the support vector machine play okay are there any questions if not let's think wrote again for wonderful you
Info
Channel: PyData
Views: 35,641
Rating: 4.9152799 out of 5
Keywords:
Id: zFScws0mb7M
Channel Id: undefined
Length: 34min 55sec (2095 seconds)
Published: Wed Jul 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.