Intro to NLP with spaCy (1): Detecting programming languages | Episode 1: Data exploration

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Vincent I'm an algorithm guy from I'm Stan and in this serious videos my goal is to sort of explain what it's like to use Spacey it's a fun tool but it might be helpful to have a couple of videos to show you what it's like to you to learn and to use and that stuff now to do that what I kinda want to do is I wanted you know I could talk about the API and that's sort of thing but I think it's more useful if I go ahead and you know take a problem and see if I can hack it so what I think might be a fun one to do is I like to find the use of programming languages in sentences so for example if I have a sentence I like to program in Python then I would like to say Python that's a programming language I want to detect that another example could be I'd like to use concurrency and go and then go the other programming language you know you kind of imagine this is kind of like a relative problem if you're let's say a recruitment agency being like if you can automatically grab the different programming languages and that's useful if your Stack Overflow detecting that is also super useful so that's it's a cute problem that's also kind of relevant and what I'll do in this video is we're just gonna you know get started with Spacey but to show you the power of it I'll first grab a data set then I'm gonna use just basic Python to see how far we're gonna get I'm gonna sort of see what the problem is that the Spacey wants to solve that I'll use Spacey we're gonna try a couple of approaches and then all you know we're gonna benchmark these approaches and then set the stage for the next video because this video will be part of the series so let's fetch something let's start with that the cool thing in this particular case is that I don't have to scrape any websites this data is readily available the kind people over it's that overflow have open sourced some of their data sets and they're hosted on Kaggle so it's pretty easy to download now there's a few data sets that you can have and pretty fun to have a go at but the data set that seems the most interesting to me is this stack sample data set I believe three years ago it's about ten percent of all the questions that they had and you gets three data sets you got do the tags that are associated with a question you get the to a question and you get the actual question itself it is a pretty big download if you press this download button you're gonna get a zip file but this is they said I'll be using in this video see if you want a program along you can just go to the Kaggle website and find it so what i've done is i've downloaded the data set unzipped it and the resulted in DS files my answers file my questions file my tags file and just as a getting started bit i've loaded this library called pandas panas is a pretty useful tool to parse these v files i won't be using it that much in this video but for merging and joining and that sort of thing is quite useful but doesn't get started what i've done years i've said hey pandas that just read the CSV and for now which is only concern ourselves with the first 1 million rows I'm only interested in two titles sort in two columns initially I might use this ID column later to do some joining and I'm for now and later sit in the title of the question and we're loading this in you do want to be careful please use this encoding without this encoding you're going to get weird errors once in a while so be sure to use that and what I've done here is I've just turned that data frame that once it's been read in into this list and we'll use a list from now on to sort of explore what we have and that sort of thing so let's just use the random library to get some impressions of what we have here so I kind of always liked at least you know peek around before I actually do something with it but let's just let's get to 20 random titles so there's a few things in here so I see that there are some tools around here if I see WebKit I see Android I see jQuery but I also see for example here JavaScript and this is pretty cool right so JavaScript is fine but that's a thing eventually what I would like to detect for now though I think it might be okay if I just constrain myself you know just a little bit let's see if I instead of going for all the programming languages out there how about just can--she myself looking for one of them constrain myself to look for one programming language I think the fun one might be to look for go there's a couple of reasons for it and we'll see why in a moment so I'll try to write some Python code now and see if I can detect if the question is about the go language so with that said let's clean up a bit so we had our titles I guess a simple way of doing this just to you know use basic string magic so I'll write a function has going so just doesn't have the go something about going in it in goes some sort of text I want to just do is I'll type something like if go in text return true I can write it at one-liner return this and what I'll do now is just some sort of so I'll do a small generator trick now actually so what I will do is I will make a generator for all of these different documents that's a title or title in titles if has golang title and the nicest thing about this trick is I can call next on that generator for like the first two of them and this way I don't have to go over all the titles because this will just give me two instances and when I run this I immediately see a bit of a problem so go of course it's you know that that's the text I'm interested in but go unfortunately is in lots and lots of words you know because the substring go isn't got the substring go is also in good so that's no good um one way to maybe circumvent this is to sort of recognize that you might be able to do something more so you might be able to say like oh well I can maybe add like a space right that way at least I know that there's gonna be space it might be a word that follows after but even that is not great because then you get stuff like Django you know Django is of course a very popular programming framework nothing wrong with Django but go now is also still a substrate now I could go a step further and I can make a plug in another space in front of it and then you kind of get this other natural problem name yet the word go is very often be verb so for example where does console.writeline go in a speedo man like this would also be invalid for the programming language go and I could go a step further like what I could do is I could maybe accommodate for like capitalization and what I can also do is write a very proper reg X probably would be you know fine thing to try but this is one of those instances where you can clearly say basic string matching is just not going to help us there is some inherent meaning of this as a token and that's not being accounted for when I'm doing a mere string matching so I need to go like one step further so what I'll do now is just demonstrate what space you might be able to do I'll just take a small detour to just explain spacey of it and once we've done that we can go back to our original problem now let's first start by just important Spacey I've downloaded this beforehand and what I can do is I can type NLP is Spacey don't load and then the idea here is I'm gonna give it a name of a model that I would like to preload and Spacey comes with pre-trained language models and I'll just load one of them and that'll be saved into this NLP variable so the name of the model I'm going to go ahead and use is the en core web Zen model that's short for English core web as M so that SM stands for small web stands for the corpus that it was sort of trained on it might be at if you run you get some sort of error so with that in mind it might make sense to make sure that you've run - so just make sure you run this command before running this otherwise you might get an error but if you've downloaded this like this beforehand and everything should be fine um and what I've done now is I've got this NLP object and there's some stuff in here let's let's let's see that so I'll give it a sentence my is that's it and when you when I read this it almost looks as if it's a string that that's being outputted but it's not necessarily true is what I can do is I go check the type and this allows me to confirm that the type of object that I'm seeing here is indeed a document as a space a document so it's a special type of object and a way to sort of confirm that this object is special is by realizing that you can loop over it so if I were to just make this small little list comprehension here you will notice that this NLP object is a collection and one of the things that's special here is it seems that every word in the text has an Associated sort of member in this set here so let's zoom in on that a little bit so what I'll do is I'll I'll make a variable called doc my name is and what I can then do is I can just query the first element which is mine I'm just going to say to the to a variable here now just to confirm that this document contains tokens and we can confirm that's the case by you know checking the type so we indeed see that these these tokens here and it's basic tokens so there's this document is indeed a collection of sorts but it is a collection of tokens and the cool thing about these tokens is they have lots and lots of properties so if I press dolls tab here and just have a bit of a look then you're gonna see that there are all of these properties that are being estimated on our behalf so some of these properties are the result of a natural language model but these are all properties that we don't have to generate ourselves these are all properties that the language model can give us so another way of experiencing what spacely does we've seen properties with a token but another cool way of seeing that is to just look at some of its visualization tools so this place he is one of these the way that this works is you can just type this place e dot render and what you then pass it is a document so I could just grab this doc right here so what is - what you're looking at here is sort of the dependency graph grammatically of the document that we've given so you can for example see here that my name is Vincent in the sentence and it's able to detect whether another word is a noun or a verb you're even able to sort of recognize whether about something is possessive now you might come across lots of words or you know abbreviations that you may or may not be familiar with especially if they miss your second language I can imagine that you know you might have to be reminded what actor and pulse means so just as a helper know that Spacey comes with this explain command where you can put something that you see here in there and and in this case you can see the pulse indeed is a persistent blood fire and in this case it also makes sense because my is possessive now the properties that you can see here aren't just available in a visualization they're also available from a Spacey library so if I were to just make a very simple for loop for tea in dog print I'll print the token and I'll print some of these properties so whether or not something is a noun a verb that is known as part of speech and whether or not this or this dependency structure you can access that by calling depth underscore you can see that you know these properties right here are also available to us in this code here as well and let's now consider why this might be useful and and to do that I'll just grab an instance of a problem that went wrong so I'll type NLP and I'll just put the title of the question in there where we saw a mistake and so with the basic string matching we might recognize go here as a programming language although it's a third but with space you now offers us is it gives us pretty good you know it's an assassination so there might be some some errors here in there but for three I do seem to be getting a property here that says whether or not the word go here is being used as a verb now I can already imagine that if I want to detect go the program in line knowing whether that it's not the verb can already be something is quite useful so I can use some of this some of these properties to write logic and that is something that looks for now now one thing I do just want to briefly mention about these properties that we have here so those properties that I've just mentioned if you go to the spatial documentation you go to usage and then to linguistic features you'll get to this page that gives you a lot more detail and also demonstrates lots and lots and lots of properties that I will not discuss in this video these documentation czar great it's pretty clear and I can highly recommend you have a look at this if you're interested in knowing more so note of the documentation is out there and it's great let's now go back and write code to the attempt to go language so back to the problem at hand what I'll do now is I'll improve the code that we have before making sure it makes use of spacy but one extra thing that I did is mainly a performance thing I have added a little bit of handless logic to make sure that I'm only looking for titles that actually have the string go in it this is mainly a performance thing panas is just a little bit faster than Python because of factorization and other reasons but note I've changed the code here a little bit but I'll be focusing now on how to change this function to make this bit of code work for us again so the first thing I have to do is you know text goes in and then if I want to have Spacey do stuff I first haven't changed it to a document so I'll first say NLP text it'll turn into a document next up I'm going to loop over all the tokens in that document and then here I can perform some logic now the first thing that I'd like to do is there's one property called blower which basically takes the token and makes a text lowercase and I would argue I would like that to be either go or going go line is also a common way of describing go to the language so I'll just go for both another thing that I'd be interested in I would like to know for sure that the part of speech is melt a verb and I sure don't forget the when these things hold I would be pretty comfortable with returning to and I would like to get false in other situations so I've changed this logic let's now see what comes out so it's better definitely the verbs seem to be gone and I actually get a something I would argue as a hick here so embedding instead of inheritance ingo so it's definitely ahead this is about a programming language so I like to see this but it's not quite perfect yet and I see two main reasons there's one example here where it says hey there's a go button and here go is sort of describing something and I can imagine here that go is sort of describing this noun here so one thought I have is maybe I should be more strict and have go or go like me than out of the sentence but another thing that I see happening here is I see this in one go happen so you know like that's sort of the saying in one go so I'm kind of wondering what is happening here and why it's going wrong and a common thing at least something that I like to use then is I would like to just take a couple of examples that work well I like to take a couple of examples that don't work well and I'll just chuck them into experts it's a display tool maybe I can see some sort of pattern my doing that so let's get an example that works well so display see render and I'll P this guy okay so go is indeed correctly detected this build a verb and it seems it's a relationship with the previous word let's peek here just a spacey they'll explain yeah so the dependency structure of this token is that it should be an object of preposition I'll check this with an example where it's not going well just to confirm if you know like not the biggest grammar expert but I can imagine that that might be a beneficial thing all right so there's go so now okay so it seems to have an auxiliary relationship there so okay so so maybe if I could change this logic that I have a little bit such that it also takes the dependency into account I might be able to get some more accurate things coming out of this so so let's build that now building that should be pretty easy because the only thing I got to do is I got to make sure that I also keep yeah got to make sure that there's an object of Peter decision as a dependency structure and then only then am I gonna return true so let's let's see what what pops out of this okay this seems to be working yeah okay so now the first five examples that I get they're all correct so this is a good position to be it now just just a like a mental habit it is very tempting right now just immediately just go through this but personally I always find that if I have this moment where I'm definitely sort of on to something that this is a great point in time to start formalizing so this code will work it works right now but what I probably want to do now is fiddle around a bit and all I'm interested in maybe doing some benchmarks trying to figure out what works well and seeing if I could quantify that so what might be good now is it just change this code up a little bit such as if we can make it a bit more performance and that we have some nice language patterns that we're using here so the first thing that I kind of want to do here is I want to make use of a pattern that we have in Spacey called NLP pipe so here's the idea the idea of NLP top pipe is if you just pass a huge list to this pipe method then essentially the documents that are in here can be handled a little bit more performing because the internals of space you can help you out there now let's let's just do a benchmark to show that this indeed does help so what I'll do is I'll just add time here this is not the best benchmark right but all let's just run this for the first get the first 10 so I'll run the timing on this without an LP de pipe and then we can compare so I got some results out of this but this seems to take about 25 seconds so let's now use an LP that Pike and we gotta remember it's 25 seconds so I'll just put n LP tough pipe here the idea behind LP that pipe is this should have we should get more performance out because the internals of space you're able to do stuff in our behalf the one thing we got to remind ourselves of though is that this function will now no longer receive a title but it will receive a document so let's be consistent with that because I'll take this out and make sure that we pass it a proper document so let's again we just had about 25 seconds let's not run this so this was done in about 5 and a half seconds so that's that's the definitely substantial improvement there's one extra thing that we could do and an extra thing that we can do sort of has to do with this model that we've got here because it's a bit under the hood but what this model here is doing is it's not just deciding whether or not something is a verb it is also doing tokenization and it's also doing limit cessation and all sorts of stuff that we might be interested in now this is great if you actually use all of that stuff but at the moment there are parts that we're not using and in particular the part that we're not using is the NE arm module that's for named entity recognition we'll get to that in a later video but for now I just want to show that if you turn that off so I can say this is a part of the pipeline I want to turn off but that stuff runs a whole lot faster as well so let's just run this and see what the difference is ok so now it's about less than four seconds so I'm I'll gladly take that that's like definitely another improvement in terms of speed so as a next step what I'll be doing now is all right a little bit of code allows me to do some logging and allows me to play with this a little bit I don't think I'll ever change you know this predicate but I'll definitely I'm definitely interested in sort of calculating maybe the precision and the recall if I fiddle around with this and if I fiddle around with that I just want to get a kind of like an idea of if this were a proper proof of concept can i maybe quantify my results such that i might be able to argue keeping keeping on with this idea or maybe switching gears a bit and trying out something else I think I'm onto something but let's now use this faster code to maybe do some benchmarks and the way that I think we should do these benchmarks is we should it's not gonna be perfect but we do want to have some numbers that mean something so what I would like to do is I would like to say look from the data sets that I downloaded originally I had this tag that I can use - - from the flight taxis v that I can join with questions that's - you see and the idea there is is that the tag is able to confirm to me that there is a question about go now the thing is that of course if a question can be about go without the programming language being mentioned in the sentence but imagine now that I take all the questions that are about go and I look for all the instances where I have go into text this overlapped between these two then I should be able to quite accurately pick out from an entire set if I'm able to do that then I have a pretty good rule now what I don't want is things like in one go coming out so I can if I just query for does this substring go up here to text then I get lots and lots of candidates that I'm not interested in but if the tags CSV file confirms that you know those titles that has go in the text is actually about go okay that I'm willing to say as a proxy that is something I can use to to do some benchmarks so I will go and write this now all right so again there's some patents coding here and all I apologize is you're a bit unfamiliar but I'll just go through the logic that I've done it's not perfect but it's a pretty good proxy so the idea is that what I've got here is these are the tags so this is data frame where it says okay here's a question ID here's an associative tag and what I'm able to do is say okay look here's just some IDs that would be a match and these are the IDS that match the go language now what I can do is I could go ahead and fetch all the go sentences so these would be all the titles that actually are about go but you have to imagine that some of these things are you know tools more than the language itself so anything that is what I will argue detectable is going through this NLP pipeline where at least I could confirm that there's a goal or goal and token in there if psychic flow says it has a go tag and I can detect go or go line in it it's not going to be perfect but I do believe that there's something I should be able to the text now there's also set that's not detectable so I'm doing a sort of a performance trick in here but the main thing that's important is that we recognize that these are all strings that do contain go as a token but are not part of the tag so these would be questions that might have the token but they should not be of the class that is actually about the go language so when I run this entire thing I get some nice numbers that sort of help me understand what's sort of one things were talking about so the numbers that Imam as I get out it seems that they should be able to detect about 1,200 which is a nice number if this was like in the hundreds and you know this number might not have been so convincing but it seems like I put about 1,200 titles where go has been confirmed by both the tag and it's also a token and there's also a few things where go is in there but it's not part of the go tag and there's also some questions about argue about 600 that are associated with a go language but don't contain the programming language in the question title so this is good enough like I have a set here which is the ones that I'm interested in and I've got my set there which is sort of thing I'm afraid of so with these sets to find the code which whipped up real quick that will sort of be logged looks like this so I have this function called has go token this function will make use of a document which will be supplied by this model and and I've got some sort of method name here and and then sort of the idea is in this part of the the code block I'm defining a bunch of stuff and below here I'm calculating a bunch of stuff which the final line here is the thing that's getting logged and the stuff that I'm calculating is I'm taking a model that I've defined up here I have a set of things that I deemed to be detectable I can tell from those are true and that those are the numbers about correct I had the same thing Laurie I could say hey if you're doing that wrong and then I can sort of calculate the position the recall and the accuracy and I'm aware the fact that these of course are kind of heuristic based but these bends of ballpark numbers should help me and an understanding of whether or not I'm onto something and that's sort of the goal in this face and of course like with all of this stuff setup I could maybe use a different spacing model I could take the medium sized well and I can take a large size one and you know I might be able to say like oh it doesn't have it doesn't have to be not a verb it has to be like exactly equal to a noun and every time I do this I can sort of change the the method that I'm using and the model and gets logged so so the goal is to just run a big benchmark now and just to see what results come out so you know you got imagine this took a while but now I have this very cool little table with results so what I have done is just for fun and kicks I thought it might be fun to do it's also kind of a to be honest is actually kind of a useful thing to do like try to run a benchmark to confirm that if you do something stupid something stupid comes out so usually you want to go not to be a verb so I figured how about I run one instance where go actually is a verb those are the three see here I'm going over different models but one thing that you can clearly see is you know precision recall and accuracy they're all quite a bit smaller oh they're really really low so then you know there are some other approaches one thing I said is how about I say that I do want this people check relationship this dependency I do want that to be in there but I don't want you to be a verb okay really really high precision so if I say that you are a go language I'm really really certain of it but I only get about 35% of all the possible ones so you know the position is really good the recall is kind of abysmal then what I do is I try to take you know what happens if I maybe say well I don't care too much about the dependency what happens then ah then you kind of see that you know the precision does take a bit of a hit but at least for the small model the recall seems to be a bit better and if I were to compare saying like hey maybe I I want to be more precise I want to say hey it has to be a noun I got a you know it's not the worst but I essentially kind of get this now so it's you can compare lots of numbers in general the one that seems to stand out for me at least right now and I should stress this is a manual thing we kind of wrote but I think I kind of like this one so it's about 90% accuracy we get about like 70% recall or so so this we have an approach where I would argue here like it does depend on the use case a little bit but this is this looking promising like this is something that we as far as a proof-of-concept goes I would argue here that Spacey is indeed doing something useful it's something that I programmed maybe in a few hours I can already see some places where if I can iterate on this just a bit further I might be able to get something maybe in a production setting where this would do something useful so what have we done we started by wondering about a problem and then we got ourselves a data set to see if it was solvable we very quickly came to the conclusion that language itself is more than just a string of text and with that in mind I was able to show some good use cases for Spacey and we were able to demonstrate some you know basic usage we got some results and we've benchmarked them and that's sort of the point where I am now now it should be mentioned though that I know of course nuts done this is just the beginning what I want to do is I want to use my current approach to aid in labeling for general programming of languages and tools I would genuinely be interested in being able to parse you know say I've got a resume I'd like to be able to parse tools that people are using once I have these labels that I can sort of you know I still have to do some labeling there but once I have them and that I might train a model to recognize a programming language as an entity and once I'm here I would like to benchmark this against some data that isn't from Stack Overflow to further stress test my application I should of course remembered that know for example Python that could also be a snake and Java that can also be an island or coffee so you will see another video where I will take the approach that currently have and I will expand on it so just so the takeaway is at least from my site do recognize that language is more than just a string of text also even though space is not 100 percent accurate it still allows me to easily build some rule-based systems that I would not be able to write with simple text matching it really feels nice to be able to build a reg X not on top of text but on top of meanings of tokens and that's quite liberating whenever you're doing something it is always good to remind yourself that it that you're trying to solve a problem first and foremost and therefore it also makes sense to maybe get a pipeline working and to take your time to clean up your code once you've hit a monster with having settled this definitely give Spacey a try and I will see you next time
Info
Channel: Explosion
Views: 49,318
Rating: 4.9389977 out of 5
Keywords: artificial intelligence, ai, machine learning, spacy, natural language processing, nlp, data science, big data, named entity recognition, ner, neural networks, deep learning, python, parsing
Id: WnGPv6HnBok
Channel Id: undefined
Length: 32min 27sec (1947 seconds)
Published: Wed Aug 21 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.