Coding Challenge #40.3: TF-IDF

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to a video coding challenge where I am going to implement in JavaScript an algorithm known as tf-idf what does tf-idf stand for it stands for term frequency inverse document frequency well what okay let's think about what's term frequency well I just did a previous coding challenge where I look red in a text and I counted how many times each word appeared let's take a look at that even though it says tf-idf the top ignore that this is my previous example running the happen up occurred in this text 16 times a 8 times of 7 this by the way excuse me is term frequency how frequent was the term in that document so what's this IDF part and why do we need it well let's think about it this is all of the word frequencies of all the words in the Wikipedia article about rainbows it's an edited a shorter version of that article you can see the word counts aren't very high what if what I wanted to do was extract or automatically through some algorithm determined keywords associated with this article well the first way I could think to do that would be hmm the words that appear most frequent frequently those must be important words for this article so I know what the keywords are there the a of in and an o and PI you can see why this is a problem those are sort of meaningless words now they're quite meaningful in text analysis if you look at James petit Baker's research check this description for a link to book secret life of pronouns that's another topic come back to that another time but in for what I'm doing trying to get keyword extraction those are not meaningful words rainbow however is a meaningful word now if you think about it if I were to look at you know unicorn is probably a bad example but if I were looked to look at the Wikipedia page for mango or for blueberry I probably wouldn't find the word rainbow in it that often so while I would find the words the a of it in so words that might appear frequently in one document but not frequently amongst a corpus of other documents might be unique or important to that particular document and that's what IDF is its inverse document frequency meaning term frequency inverse document frequency meaning a words score equals the term frequency let's say it appears 100 times x inverse document frequency meaning the more frequent it is in other documents the lower the score should be so how do we calculate that well I could say let's say it appears in 10 documents I could just say 1 divided by 10 instead of 10 so 1 divided by the number of documents it appears it and you can see if it's only in one document right I'm going to get a score of 100 or I'm going to get a score of 10 so actually in the technical definition of tf-idf the number 1 doesn't go here but actually the total number of documents so let's say we're actually examining the corpus is 50 documents and then you could see here that you know we're kind of going to get similar if this this is mostly equivalent to score relative but this is the technical definition so 100 times the number of documents divided by the number of documents that appeared in so this now I would get a score of 500 and here I would get a score of 5,000 now this is also not exactly if you look up if you keep reading that Wikipedia page the technical definition of the score of tf-idf there's one step board here which is that this is perhaps weighting it way too strongly in order to reduce the effect that the document count has on the score logarithmic scale is applied so the log function is applied so you could play around with this formula be creative and come up with your own twist on what you're actually looking for but under the technical definition under wicked Wikipedia you'll see that the the log function is applied now why is that applied so how'd and how does logarithmic scale work I'm going to in this video in the description link you to a really great conic a video that describes logarithmic scale in detail or maybe I'll come back and make my own video about that but I'm going to move on and just have you understand it as a way of changing the of reducing the effect that the document frequency has and which can overwhelm the score if you don't apply this logarithmic scale okay so this is the idea of term frequency inverse document frequency instead of now what I want to do is have a program that doesn't just show me the counts of each word but shows me this TF idea score so the question for you is like well what do you why are you using this and what's your corpus so you might think about well let me look at you know every single newspaper up pick a newspaper and every soval these paper articles or a Wikipedia article and all Wikipedia articles you kind of you want a big data set to be able to have this produce meaningful values I'm going to demonstrate this to you with a small data set look at something that works but obviously it'll be a you know a little bit flawed based on the small data set let me just show you the data set really quickly so I'm this is the previous word counting example running verbatim but I'm going to alter the code and instead what I'm going to use is here under files I'm going to use these particular these these text files and these are just text files that are a short paragraphs from Wikipedia like the fish a fish article faith key I don't know where I got this rainbow txt sports txt test and treat txt so I'm just going to pick I'm just going to use tree sports rainbow and fish okay so what I want to add in my sorry what I want to add in this particular example I'm going to go back to the code is I want to create an array of files Eclipse txt fish fish txt rainbow txt and and actually I'm going to do this in sort of a silly way I know I'm going to know rainbow dot exe I'm gonna do that and what was one of the other ones let's try sports dot txt I don't know why I picked these some one point I picked these okay so so these are the files that I'm going to use so I can't just load only these files I need to load all of them so how do I do that oh I want this to be an array I want that raw text to be an array and I'm going to preload all of them so I'm going to say for VAR i equals 0 I is less than files length I plus plus I'm going to say text index I equals load strings files and then plus the files index whoops plus no quotes their files index I so this is me now loading all of those files into an array and then what I want to do here is again all words I need all words not to be a single document I'm loading a bunch of documents and I'm by the way I'm doing this in quick and dirty fashion to demonstrate tf-idf if I were really building this out and I'll show you a different example after to be a bit more thoughtful about how I'm loading all these files and maybe thinking of a more sophisticated data structure to keep track of all this data but this will work for now so I want to this is a good first pass I think so I want to loop through all of the things I loaded and I want to create on all words array and say all words index I join okay so that's good enough for right now because what I want to do is I want to pick what I'm going to do arbitrarily is I'm going to say and let's let's use rainbow and put rainbow first I'm just going to do the tf-idf score for the first article as compared to all the articles again you would want a much larger data set you probably want to have many articles with longer text in them and you might want to build something which allows you to click on different articles and compare that one to all of them but I'm going to make a simple introduction this idea by comparing the first article to all the other articles so now let's just make sure my term frequency is working so instead of splitting all words I want to split all words index zero and I want to still count and count all of those and and visualize that count so everything should work now where I whoa something went a little goofy here because why did I get such higher yeah I mean I'm getting words from others so let's see what did I miss here ah there we go I see the problem this should say txt index high right I don't want I joined all the words from all the articles which I don't want to do I need to still have some way of distinguishing between articles if I was just doing a concordance across all of them that would be fine but I need to keep that just that distinction so let me make sure let me just make this a global variable for make debugging a little bit easier even though I don't necessarily need to and that's so that I can make this an array and get rid of this here so let me run this again we can see here's the raw data from the files and here is that array now we can see yep we can see a hold on all all words I still think I might have a mistake in here dot length no four we're fine all words index zero is rainbow all words index one is Eclipse okay so because either short so this is good I now have all those articles in an array of strings and you can see here they are the counts for the rainbow article again these numbers are really small so my results aren't going to be so great but let's see how it goes so here's the big difference this is an incredibly important point here in my previous videos about word counting we created an associative array each here and was using a JavaScript object the JavaScript object would have a word like rainbow associated with a count like 10 this is no longer going to be good enough because what I want to have now associated with rainbow is to values the terms frequency and the document frequency I need the term frequency and the document frequency so how do I do that if a javascript object has a word associated with a value how can I have a word so a string associated with two values well this is actually quite native to how JavaScript works there's a bunch of ways I could do this for example I could have the key be rainbow and the value be an array with two numbers in it this might be actually a nice way of doing it I'm going to do it in a little bit of a goofier way which I think we'll build later I want to do Bayesian text analysis bayesian text classification and we're going to need to be a bit more thoughtful about this because we're need a lot of numbers with each word so what I'm going to do actually is associate an entire other object with each word so rainbow is the key and the value is term frequency 100 and document frequency you know 1 so I'm going to have the key be a reference to an object itself with two values in it so this is going to be really useful for later text analysis operations where you might want to have a lot of metadata I mean what if I want to have rainbows definition or a you know a link to the word Nick page that so that you have a lot of metadata associated with each word if I were to build out this object okay so here we come back so I want to modify the way that this works right whenever I encounter a new word I want to actually not just set the value to one but I want to create an object with a term frequency a term frequency of 1 and a document frequency of well let's not do the document frequency yet I think so I want to put it in it I want to document frequencies to go in there but I want to make an object with term frequency of one okay so this program now that's what now this will fail because it that the count is no longer in there you can see there's this object here so what I need to do if I want to visualize this again is say counts index key dot TF because those numbers are now in the object associated with that word in the variable TF Oh undefined interesting oh why is that well here we go because I need to increase the term frequency here right not just increase the value by one okay so now now we're back to where we started ah sorting is also broken because I'm comparing the term frequencies and you can see now we're back to where we started so now I have an object and what I can actually do is figure out document frequency now okay so let's add that in so what I want to do and let's actually let's count it you know what um there's a good bunch of different ways we could do this so I'll try to think what makes the most sense I'm gonna leave it as is and what I'm going to do now is say okay once I'm done with the term frequency I need to look at every single word and count how many times appears in other documents okay let's figure this out oh this is going to be we could do this wow this is hard this is hard so I know that I need to loop through all the keys so let's first loop through all the keys right all right let's I don't need to worry about sorting I'm going to loop through all the keys so my word is keys index I so now what I need to do is loop through all of the files who all I'm going to say all txt equals all words length whoops ah I J a J equals 0 J is less than all words dot length right all words is the array I need to for every word I need to look at all the other documents and see if it's in there so I just need to I didn't do the counts I just need to know does it even a in that document I could do the counts for like how often does it appear in those other documents but I'm going to do it in a simpler way okay so for each one of those what do I need to do I want to check actually you know what mm-hmm let's do this a different way no this will work okay this will work so I need to do like a mini little concordance right here for all of those so I'm going to just say a temp temp counts equals an empty object okay then I want to split up and how did I split before what regular expression do I use this all words index I I want to split it up into tokens oh my god do I really have another loop here VAR k equals 0 I'm sure there's like a way to simplify this that is unnecessary I don't think yeah there's a I'm going to be able to make this more efficient but I'm going to keep going with this I don't think I need the outer loop I'm thinking about this but let's keep going k is less than tokens dot length okay I want to go through all of the tokens and if temp counts tokens index I is undefined yeah I just wanted to there's a better way to do this I can already think of it ah I'm running out of space here why is it word wrapping me I don't like that if ten counts tokens I is undefined then I want to just give it a value of one like it exists it's in that document so I want to get a unique list of all the words in that document and once I've done that I can just say if temp counts word exists equals one then word dot document frequency plus plus and by the way here I should give it a document frequency of zero so I'm not going to count this particular instance as a dot document frequency because I happen to be checking the same document again let's let's be let's be better about this let's start this as one so I give it a it's in this document and then I'm going to only go through I'm going to skip that first one in the array so there's less computation here so then I can and let's let's make a word a variable tokens index I because this will make the code more readable and that way okay so and I this is wrong this would be counts in index word D F plus plus whoa okay I kind of wrote all this out I'm probably made some mistakes I think I could I should just doing the concordance every single time for every word which is ridiculous I should do the concordance for the other documents once and then check all the words but I'm going to fix that in a moment but let's even just see if this works this is what happens when you program you're kind of a mess and it's just sort of like you're trying to figure it out so let's just read this through again for every single word in the article that I'm trying to find the keywords for I want to look at all the other articles and quickly do a quick concordance by looking at all those words and then once I've done that quick concordance I want to see if it's in there if it was in there I would increase its document frequency by one and once I'm done with that let's run this program okay I have it error sketched out J s line 35 oh no wonder I equals zero I is less than keys dot length okay look this is taking a very long time I probably have an infinite loop somewhere in this code I probably have locked up the browser right can i scroll can I select anything no so I need to kill this page I'm going to close this page and go back and look so let's let's debug this together ah this should be K by the way because I'm looking at all the tokens and where could the loop have gone wrong ah here we go that should be that used to be a less than I had equals I had an infinite loop stuck in there I'm gonna kill this page okay let's come back run this again but still got a problem let's look at this some more infinite loop anywhere okay I'm back I think there was actually a caching issue when I had that infinite loop so I now I just restarted the the webserver that I'm running and so I now have this running again but I still don't know if I fix the problem okay can I tree property split of undefined line forty why don't I have undef' is uh why is that Oh J this should be J that's really important that has to be J it's part of that loop cough hat okay things are still good and now this is where I'm doing the quick little concordance things are still working I don't know what went wrong there but I think I just had the infinite loop and the infinite loop the browser couldn't like not get out of the cache of that or whatever so let me look at counting now and we can see look at this a is in one document with a term frequency of 8 above is in one document the term frequency of 1 now this doesn't really seem right to me I expect that I should have some document frequencies of more than one okay so let's see if we can figure out if the document frequency isn't working properly so one thing is I should definitely add this to lowercase thing let's see if that helps and I'm going to say counts again and you know a we should definitely a should be in more than one document and just to be sure about that let's go to the files eclipse and I'm going to add a a a just to be sure about that I'm going to add a couple instances of a just in case it it wasn't and now I'm going to refresh this and look at the counts and then we go to a and the document frequency is still 1 so there is a mistake here let's see if we can figure out what I did wrong in the actual code so now I'm going back to sketch J s so let's look at let's look at all these temp counts so as I run this let's look at all these little mini hash tables I created for each other document 1 true - true 1 true - so this this looks right a is true so I'm definitely getting why I put the value is true oh I said true and then I checked if it's equal to 1 so that has to match of course I want to know why I put true to keep track of it's in the document and so I could just check if that evaluates to true or false so that that's clear the problem and I can I can take out this console dot log and I can look at the counts again and I can now say a document frequency is still 1 okay so that doesn't seem like it makes sense oh boy I see the problem it's very obvious which is that I used word to keep track of the word that I'm currently trying to determine the document frequency and then later in the inner loop I kind of like I made a different word variable and in JavaScript you can't really have these like local variables they clash everything and this is this is a total mess I can't have declare a new variable in a sort of like local or more local scope than the then another variable with the same name so actually I think I just want to get rid of the use of I'm just going to I'm going to call this W that'll solve this problem so I'm just going to use W down here so that when I'm down here I'm referring to the correct word now come on everybody are you with me let's look at the counts and let's look at a and we've got a document frequency is four this is excellent news so now we have term frequency being calculated and document frequency being calculated now let's let's make an improvement here this is really I hate to make this video any longer than it is but this is really bothering me that I have I'm redoing the concordance of all those documents for every word which is pretty much ridiculous if I don't say so myself so I'm going to take this out this this loop out and do it once in advance right I want to look through all the other files I'm going to say other Phi other counts is an array right of all the other counts I'm going to create this temporary counts dictionary and then when I'm done creating that temporary counts dictionary I'm going to put it in other counts okay so now I've done it just once I have a nice array of all of the all of the dictionaries all the unique words in the other documents so now when I go through this list of keys I don't have to realize still have to look at all the other concordance --is the other counts length but I don't have to redo each one which was kind of absurd and I can say now and I hope I can say now temp counts equals other counts index J so now for every word I can see if it's in the other one so I'm just doing all of these concordance as once and then checking them here so hopefully that fixes this still works up I got a line 77 syntax error which is Oh an extra I'm out of I did too many brackets yeah I Clos into too many brackets and now let's look at counts and let's hope that I'm still seeing the right yes so I just made this enormous ly more efficient which is a really good thing thankfully okay now we can actually now I can actually do the the tf-idf score so the next thing I can do is once again look at all the keys and I can calculate the score so how do I do that I can say for I can say give me my I can say give me the TF the the word object and trying to think of giv I'm just gonna say value is a counts and this should really be TF i should change this variable name from counts counts word actually I'm just gonna write oh this is so hard to think of good variable names on the fly but Val is like the worst right so I'm just going to say word up word data that's not much better but the word date is all this stuff the term frequency term frequency and document frequency that's what's in there right now so now term frequency inverse document frequency equals term frequency document frequency TF right I want the term frequency from that object times I need to make it inverse so what are the total number of files I want to say files dot length divided by term the document frequency now by definition I this would be a problem if I could have a document frequency of zero but I started all document frequencies at one so I this shouldn't be a problem so I can put this in parenthesis and then take the logarithm of that and I've done the formula and then I can say why is my text wrapping I don't like that and then I can say T fdf tf-idf equals that score and really what I want to do is just this I want to put the score in that object and I don't like the name of this variable I'm just going to call it word object because it stores all the stuff even though it makes me lose my wraps that line again so here we go I want it at tf-idf score the tf-idf score is the term frequency times the logarithm of the total files divided by the work that the document frequency and now what do I need to do I can just go back down here and say hey guess what remember when I sorted that array I don't want to sort by the term frequency I want to sort by that inverse document frequency score and I want to display that tf-idf score and let's see what happens with a rainbow's txt if I run this you can see look at this perfect arc rainbow light cause so we're getting higher numbers for words that are associated with rainbow and a little hack that I can do just to test something else is I can you know let's try sports txt instead if whatever I put first in this array will be checked or conduct unsportsmanlike games such so again I'm not getting or is a bit of an anomaly here because I have such a small data set so or appears very frequently here but if I had a larger data set I would get more accurate results okay so what can you do with this this is a bit of a mess of a video but I have now working code awfully you learn something or you got something out of it couple things I'll mention one is I have a more thoughtfully designed example which again I'll link to in this video but you can find on the A to Z website where if I go here I can actually drag a whole bunch of files in drag and drop them and it'll do tf-idf for all of them you can see what it does it actually makes a bunch of clickable links so you know I could click on let's say if I click on Fish txt I can see now I'm getting I think it's like I don't know what the weird let's look at test at txt or rainbow txt we got we getting the same results us I did something different with the math formula of the exam and that but you can see I'm getting basically the same result so you can look at this example if you want a more thoughtfully put together example that allows you to try a bunch of different files and drag and drop them but ultimately I would say to you is number one what's your corpus of text what's a unique and interesting corpus of text that's meaningful to you that you might want to try and number two is how might you present these results in a more creative fashion you know what happens if you did this to all of your emails certain emails one email versus all of your other emails that's a corpus or and how might you visualize that information with color with shapes with drawing or simply in some interactive way to allow a user to play around this idea so if you make anything try anything let me know and I would love love to love to see it so write in the comments or share with me on Twitter at Schiffman and see you soon you
Info
Channel: The Coding Train
Views: 53,594
Rating: undefined out of 5
Keywords: counter, word counter regex, patreon, creative coding, coding challenge, javascript (programming language), daniel shiffman, tutorial, programming challenge, text analysis, programming from a to z, data and apis, coding, word counting, word counter processing, programming, word, data, string object, introduction regex, intro regex, regular expressions, tf-idf, Term Frequency–Inverse Document Frequency, tfidf, term frequency, inverse document frequency, keyword generation
Id: RPMYV-eb6lI
Channel Id: undefined
Length: 32min 2sec (1922 seconds)
Published: Wed Oct 12 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.