Identify Stocks on Reddit with SpaCy (NER in Python)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi welcome to this video on name entity recognition with spacey so i'm going to take you through how we can use space sheets for name density recognition or ner and first what we're going to do is take this text that you can see here and what we are aiming to do is extract all the organizations i mentioned in this tips so in our case with this we are wanting to extract arc so we're going to look at how we do that i'm also going to show you how we can visualize that process as well using displacy which is a visualization package embedded within spaces super cool now i'm going to show you how we programmatically extract entities so obviously visualization is great but we do want to just pull those out in a more programmatic fashion so we're going to do that and then once we have done that process with our single example obviously we are going to want to scale that to a ton of examples so what i have is a sample of i think it's 900 posts from the investing subreddit and we're going to build out a process that will take all of those pull out all the entities that are being mentioned prune out a few that we don't really want and then give us the most frequently mentioned samples or organizations within that data set so let's just jump straight into it we have our text data here and this is just a single extract from the investing subreddit and i'm just going to use this to show you how spacey works and how we can do name entity recognition on this single piece of text we want to start by first importing spacey and if you don't already have spacey it's very easy to install you just pip install spacey and enter and that will install the module for you and we're going to be using both spacey and i'm also going to show you something called displacy which is like a visualization package that comes with spacey so that is from spacey import display c and once fun part days we also want to load in our model so spacey comes with a lot of options in terms of models and we can see those on spacey io models if we come down here we see this this is the model that we will actually be using i just want to quickly cover what we are actually looking at here so come down here we have the naming conventions and we see that the name is built with the language and the name the language for us of course english which you can see here and this last part is the name now name consists of the model type genre and size the type here is core which is a general purpose model which includes vocabulary syntax entities and word vectors we're interested in using entities for the ner tasks that we are looking at web is the type of data that the pipeline or the model has been trained on so the two examples they give here is web or news web includes stuff like blobs and reddit fits pretty well with that so we're going to use web and then we just have the the model size here we're just going to go with small to download that we go back into our command line interface and we type this i would type python m spacey spacey module and then download and then your model name here which in our case so english core web small model i'm not going to enter this because i have already downloaded it but that is what you need to do and once that is downloaded we can then load it into our notebook so we'll load into this nlp variable and we'll do spacey load and then again we just enter the model name there we go that is our model now to actually process this data it's super easy we will assign it into this doc variable and we just take nlp and add in the text now print that out okay and we can see okay we have this which just kind of looks like the text that we passed in there so it's a bit it looks like nothing has actually happened but that is not the case this is actually a spacey document object so if we click help dark here you say okay we have dark objects and then we've got all these different methods attributes everything in there so it has worked and that's good because we can then use a doc.hence to access the entities that spacey has identified within this document object or within this text and we can see here although it doesn't say what type these labels are we have arc arc etf another bear cave this doesn't tell us that machine information but the information is here so i want to quickly show you displacy because it's uh it's pretty cool and i'm gonna visualize what is actually happening here so we'd do display see render we pass in our document object and the style for the visualization there's a few different styles we are going to be using the entity style and this is pretty cool it shows us the text and then we have these labels on top and we see that arc and etf are identified as organizations etf we don't really want that in the etf as a change traded fund and it's not really what we're looking for in terms of the organizations nonetheless it is identifying art correctly three times which is pretty good now work of art when i first saw this i had no idea what that meant to me that seems like it's picasso painting or a statue for michelangelo you know i really had no idea what it meant by a work of art so what we can do to get a small description of each label if we uh don't know what it means is we just type spacey explain and we'll just do work of art and then we'll see okay it's times both songs etc so that makes a lot more sense than what i was initially thinking and it also fits quite well to what this is so this barricade thing here is actually an article and it's not quite a book but it is something that someone has written just like a book or a song so it fits in with that category in my opinion so that's great visualize these entities and the text but we obviously want to process this in a more efficient way we can't just visualize it so this is where we go back to our dock ends here and what we want to do is actually work through each one of these in a for loop and although these look just like they are texture or something along those lines they're not they're actually entity objects so let me just show you how we deal with that so we go for entity in dark ends so we can print out the label which is this org or work of art and we print that out by accessing the entity object and going into the label attribute and just notice that there's a the underscore at the end of that attribute name just remember that and that will give us the label so org or work of art and then we can also find the entity text so we just go and see and then we can type in text there as well and then we see hey okay that's pretty cool because now we've got the organization work of art and then we have what it is talking about which part of the text is actually extracting out for us there so that is really cool and really useful and that is actually all we need to start extracting the data out and processing it so if we just come down here and take this loop and it's going to modify a little bit and we're going to extract the organizations from this list so we're going to initialize a org list and then here we want to add some logic which says okay if this is a org organization label we want to add that to our old list so to that we say okay if label label is equal to org org list append entity text and let's just view our old list at the bottom here okay so here it's entity not label and here we get our list of all the organizations so it's excluded where cave because the bear cave is not a org it's a work of art so that's pretty cool but ideally from my perspective on what we want here is we don't need to have arc popping up three times we just want to say okay what organizations have been mentioned we don't care about how frequently they've been mentioned in a specific item so to do that we just convert this to a set which will remove any duplicates and then we convert it back into a list so org list equals list set org list and then let's just see what that looks like so now we just have etf and arc and that's exactly where i wanted this to be okay so we've applied this to a single piece of text but we want to apply this to a full data frame so first thing we need to do is actually import a text so i've pulled this from reddit so this is the data that we're going to be using okay so we're pulling this from the investing subreddit and we're using the the red api to do that now if you haven't used reddit api before i do have a video on that so i will leave a link to that in the description otherwise you can also just get this data directly if you don't want to you know go through the whole reddit api thing and i will leave a link to that in the description as well so just separate this and now we just want to import pandas and now i just need to read in our data as pd read csv and this is in the data directory for me and is reddit investing dot csv and the separator we're using here is the pipe delimiter so let's just make sure we've read that incorrectly and there we go so we have our data and the thing that we really focus on here is this self-test column so in here we just have 836 posts and we'll just apply our ner to all of those and just see what people are talking about so we need to convert what we did up here into a function that we can then apply to our data frame so let's take that and we're just gonna convert this into a function so we'll call it get entities and then here we'll pass in a single string we'll add that in there and we'll say here we need to create our document object so nlp text we've already defined the the model up here as nlp is this variable initialize our entity organization list and then work through each one of those and append it to our list and then we just want to return that list but we also do want to remove any duplicates so we'll just return the set list version now we can run that and let's just apply that to our data frame so we'll create a new column called organizations and we will just take the self text column and apply our get entities function to it let's just see what we get so this will take a little bit of time because we're processing a lot of entries here obviously if you're doing this for a larger data set you're probably going to want to batch just a little bit so keeping it on file somewhere reading maybe up to a thousand samples at once applying this and then saving it back to file and just working through like that so for us we can see straight away we have some things we probably don't really want in there so i'm not sure what these are and then we also have this smp 500 pe s p 500 you know loads of things it's not really what we want in there because we just want actual company names so what we can do is create a blacklist what i mean by a blacklist is we just create a list full of anything that we don't want to be included for example these here we i mean we really don't want those now we don't necessarily need to do this as well for everything because what we will find with a lot of these items that we don't really want to include in there in fact actually i think i'll probably keep these two in as an example what we will find with a lot of these is that they only appear maybe once or twice in the whole data set so we can actually filter those out by only searching for organizations that appear at least sort of three or four times within our data set now just filters out all the rubbish that we get with these ones but in other cases like sec that will appear quite a lot and we don't necessarily want to be finding where it comes of the sec and in some cases maybe you do want to but in this case i'm going to remove it i'm going to remove the smp 500 as well and maybe leave it like that i'm not sure where to i assume lemonade isn't a company so i'm just gonna put that in there as well and then there's a few others i've noticed before that come up quite a lot we get like the fda treasury fed appears all the time uh cnbc always appears eu always appears and i think that's probably a fair few the ones that we don't want in there so we'll include those and to exclude those from our search we just add another condition here sorry and condition and say and entity text and you'll see here everything is in lower case so we also apply a lower here so this means we don't have to type out you know fed in capital and fed in lower case we do empty text lower not in the black list and that would just load any that are included in there now we can just update that as we go along so just to rerun that and we'll rerun this as well and we start writing out the next part of what we're doing here so what i want to create is essentially a frequency table so we want to have each one of these companies and we want to see how often or how frequently they are mentioned so to do that we can use a counter object from the collections library so what we can do with that is we simply pass a list for example and it will go through and count all the instances of a specific value and then organize them into the count object which gives us a few useful methods for actually viewing that data for example viewing the most common values in that data set so that's pretty useful that's what we are going to be using so to use that we need to import it from the collections library it's the counter object and like i said before this needs a list and at the moment we have a column in the data frame so it's not really the right format that we need to transform it into a counter object instead what we need is just simple flat list so first thing we can do is take that column and convert it into a list so the organizations to list you can see here okay we do have lists but it's actually a list of lists so we've got a list and within that list we have all these other lists and we don't want that for our counter object we actually just want to play in the straight list so we need to add another step to the process which is flattening that list so we'll call it orbs flat and here we're just going to use list comprehension to loop through each list within the list and pull out each item into our new list what i mean by that is org here is like a single item within the sub list so if i just view the first two here so org is like the s p 500 pe here and then that will be our item that makes up this new list that we are making and they will come from a sub list and these sub lists are of these lists here and we need to iterate through each one of those sub lists for each one that is within our orbs list which is uh the full thing and at the end here we're just saying go through each orb so each item in the sub list which is kind of a confusing syntax but it it works and it's just something that you you get used to if you're not already so then let's view the first five entries in that it's always flat and there we go we have a few comments so now we can pass this into our counter object so do frequency counter orbs flat okay and then we can view the most frequent of those by using the most common method and then here we just pass the number of the most common items that we'd like to see so if we'd like to see the top 10 let's pass 10 and then here we can see that we have the most frequently mentioned organizations from the investing subreddit data that we have there's a few things in here that we probably want to get rid of like ev etf covered we've got sockets change spec you know there's a few items in there that we can definitely prune out with the blacklist but overall i think that looks pretty good and this very quickly shows us how easy it is to apply named entity recognition to a data set to actually extract what the text within that data set is actually talking about now if you start pairing this with things like sentiment analysis it can get pretty cool so i mean that's definitely something that i think we will cover soon but for this video i'm just going to leave it with ner so i hope this has been useful i really appreciate you watching this and i will see you again next time
Info
Channel: James Briggs
Views: 911
Rating: 5 out of 5
Keywords:
Id: TCZgXFPNnbc
Channel Id: undefined
Length: 21min 47sec (1307 seconds)
Published: Wed Mar 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.