How to Separate Sentences in SpaCy (SpaCy and Python Tutorials for DH - 03)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome back to this series on uh spacey for the purposes of dh if you haven't been following along following along spacey is an nlp module that allows us to analyze texts and we're working with this here alice in wonderland text the entire book and in the last video we separated it out so we could analyze it chapter by chapter and we loaded in this spacey model so which was a trained model on a neural network so that we can analyze our texts so what we're going to do in this video is start working with some of the basics all we're going to do in this video is create a spacey object known as a doc and spacey uh documentation and we're going to use that doc object and start running some simple but very powerful spacey functions we're going to separate everything out into just sentences this is very important to know how to do you need to know how to separate text out by sentences because human speech follows the pattern of sentences and you also have to know how to break down a text by what's what are known as tokens which is i'm gonna get in that in the next video tokenization so for right now let's go ahead and start trying to figure out how to create a spacey object what we're gonna do is we're gonna create an object called doc and we're gonna make that equal to nlp so we're going to call in our model nlp there we go and what we wanted to actually analyze that model is we wanted to analyze one specific argument we wanted to analyze the text so here i have chapter one selected and if you remember from the last video what we're going to do is extract all of this information from chapter one and we can load it into a single object because we've separated it all by chapter and then we're just gonna analyze everything chapter by chapter so that's what we have and if we run this nothing is going to happen all that's happening is on the back end it's loading this model and we're creating that object if we want to actually print off that object we can just print off doc but it's not going to really show us anything because it hasn't actually done anything to that doc and so you're just kind of doing nothing different than just printing off the normal object of chapter one so if we want to start doing some interrogation of this data we can create a new object we're going to create sentences and this is going to be equal to doc.sense and what this is going to do is it's going to separate all of that doc out by sentences so if i were to print off let's say sentence one what's going to happen is it's going to be something weird it's going to be something like down the rabbit hole let's go ahead and experiment with that it might even be just i period if that's all it's grabbing generator object is not scriptable ah here this is why i have to make that into a list that was my fault there we go now we have an object that can be scripted there we go we have i period down the rabbit hole so let's grab this as the first sentence obviously that's not a first sentence it's a title if you want to you can clean up your data and get rid of that or what you can recognize is that every first instance is going to be the title and separate that out as its own object let's go ahead and just jump with our actual first sentence which is going to be alice was beginning to get etc and if we look down here it's grabbed the entire first sentence now i know you're probably thinking why couldn't i just split the whole document up by using the split function and separating everything out by periods or any kind of punctuation like in this case a question mark well the answer is that the split function doesn't catch everything it can account for inconsistencies in the text so if a punctuation was missing for whatever reason it also can't account for punctuation that exists in the middle of the text in the middle of a sentence for example a period after a person's middle initial it would separate that out as a separate sentence even though that is a noun functioning in the middle of a sentence so that's why you wouldn't want to use the split function to do this so if those who aren't familiar what you can do is you can use the split function and make the thing that you want to split at the perfect uh the the punctuation mark so another problem is how do you account for various punctuation marks exclamation question mark period you might be thinking to yourself i can just use regex to do that i can write a long regex formula and account for all those and you could but regex doesn't catch everything the reason why you want to use an nlp like spacey here is because nlps will are trained models that can account for these inconsistencies and text and can account for various punctuations and they can account for the periods being used in the middle of a text the periods that should be there like the periods after an initial so i'm going to show you one more thing and this is going to demonstrate something that you should really consider when you're cleaning up your data if we look at this text we are going to see that it is every line of text not every sentence is a new line of text and this is going to present our model our simple small model a problem if we print off sentences to watch what happens the output down here it ends right here at 4 and it took me a second to figure out why this was happening once i did it was very obvious i can fix the problem if i want to by just loading in the much more expensive nearly one gigabyte large model if i print off sentences too i believe it's actually going to be wrong because it's actually going to recognize that that's a title so let me run it and you'll see that right now this is the large model running much more expensive much longer to do the exact same task because it's much more complicated and the second sentence here is this one and this is because this model has actually grabbed this information separated eye etc so we need to print off sentence number three we print off sentence number three you're gonna see that the large model has grabbed all of this but why is the small model here it goes you see it right here but why is the small model having such a problem grabbing this well the answer might not be that obvious i'm going to go back to the small model and show you what we can do to account for this what is occurring here in the text that we can't see is a series of line breaks that look like this in the actual text file backslash and backslash n it's a double line break that occurs after each of these uh after each of these occurrences and a double line break after each or a line break at each of these occurrences and a double line break at each paragraph so what we can do to kind of fix this is we can read every single thing like this in and we can account for that by simply splitting the document or sorry not splitting replacing all these double line breaks with a space and what this is going to do is it's going to allow for our document to read in all these line breaks and get rid of them it's going to replace them with simple spaces so that this doesn't look like this when it's read in it looks like that like a normal sentence so let me save this and what i'm going to do now having just written this in you'll see that the small model can now actually account for it here you go it's already done now the small metal model has actually grabbed this entire sentence why well the answer is because it's reading it like a sentence and we can go ahead and what this is doing is getting rid of the double line break on the paragraph we can also get rid of the single line break on those instances there and that should help it even more there we go now we have one long normal sentence read in like that and it's not going to be dropped down like you'd expect to see with line breaks so that's how you can read in the the data the text file modify it a little bit clean it up a little bit with the replace function and line breaks if you have it and that's why i picked this alice on wonderland text because a lot of time your texts are going to have these line breaks and it's going to cause nlps to go crazy this is how you fix it we read it in we pass it through a model we then create an object of that passing through the model and then we can start doing some cool things like breaking down the different sentences once you have a text broken down by sentence now you can start really doing some complex things you can start analyzing different words noun chunks in a sentence you can analyze relationships in a sentence between nouns and verbs this is what we're going to be doing over the next few videos thank you for watching this video if you've enjoyed it please like and subscribe down below and visit us at pythonhumanities.com
Info
Channel: Python Tutorials for Digital Humanities
Views: 2,398
Rating: 5 out of 5
Keywords: python, digital humanities, python for DH, dh, python tutorial, tutorial, python and the humanities, python for the digital humanities, digital history, Python and libraries, pythonista, WJB Mattingly, Mattingly, Mattingly Python, wjb and python, wjb python, wjb mattingly python, spacy, separate sentences, sentences in spacy, spacy and sentences, spacy doc object
Id: ytAyCO-n8tY
Channel Id: undefined
Length: 8min 32sec (512 seconds)
Published: Mon May 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.