Analysing Books and Visualising Book Data with Python and Jupyter Notebook

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome karima patel and julieta a few ladies here today um oh 151 resume here yeah that's actually perfect i said we're going to start 805 this is 145 to go so okay battery now it keeps freezing i was running fine for a whole while and for whatever reason i can't see my powerpoint preview of the next slide so i might have to like re share the uh slideshow [Music] why we are we got even a few more um welcome to somebody just left but who's new all right i am going to disrespect the counter i'm going to start this off by starting the presentation in a way that i could see the preview okay we are underway so hello everyone and welcome my name is sunyan with me is muhammad he's going to be helping out with any questions that come up in the chat or anything else um this is going to be a webinar all about book analysis using python and jupyter notebook so this work is based on a unit of work i published on the digital technologies hub which is the first the main place by the way am i sharing this thing uh no oh that that was it so the webinar today is based on this unit of work i published on the digital technologies hub um i it's basically linked in that resource link and i can drop it in the chat later mohamed in fact you can look it up book analysis with ai techniques and drop that in the chat um so this was actually done for high school students and it's being used by hundreds of schools in australia so the webinar here is basically building on on that and doing some more advanced analysis and some data visualization on top of the data analysis so um that's basically that unit it's a collection of video tutorials with some resources and quizzes that come with it's completely free because it's funded by our kind government down here so yeah check that out now natural language processing what is it um so that's the field of computer science that um basically is designed to use uh language and try to make sense of it and try to get computers to both analyze language for meaning and then synthesize language themselves this is the oxford dictionary definition the application of computational techniques in the analysis and synthesis of natural language and speech the domain of nlp includes both natural language understanding and natural language generation up until recently the generation was the very hard part and something changed in july so we're gonna get a bit of an insight into that but uh yeah if you wanna know what natural language processing is it's the intersection of computer science artificial intelligence and linguistics so that's that's that in a graph okay so components so you could think of it as a linear process where you analyze text by doing these steps so first you clean text remove uh punctuation perhaps you remove like words that don't carry any meaning tokenization is you categorize text into you know nouns verbs adverbs adjectives and so or break it into sorry tokenization is more often you break it into either word sentences paragraphs or chapters um stop words is i guess a further cleaning of the text by removing words that have no meaning frequency distribution is um basically an analysis of how often certain words appear sentiment analysis is a study of basically positive versus negative emotion in text as well as sent as well as bias and tagging is the official categorization into nouns pronouns verbs adjectives and different types of linguistic tools so let's actually go into these step by step first up removing punctuation so this is how we're going to do it there there's going to be um so this is like the presentation part we're going to do this in code quite soon so you can imagine having a variable that contains all of these punctuation symbols and just using a simple string.replace method to get rid of all of them and replace them with nothing so you can basically remove punctuation from a book worth of text with about four to five lines of python without even using you know an external library so next up so this is by the way an example um if you were to do this process you would go from something that has commas exclamation marks question marks to something that has no punctuation next up tokenization so you can have uh you know your original text and again a couple of lines of python you could break it into words sentences paragraphs or chapters you'd use the split method so you could split it on um basically double spaces which would be paragraphs single spaces which are words full stops which are sentences and so on so a couple of things in the chat um okay i will find that that um jason by the way we generate that on the fly when you run the function so maybe the words.json isn't actually missing it's something that's going to get generated when we run it so that's a good question paul okay um stop words what are stop words uh so stop words are words that are either extremely common or have little meaning so uh built into uh one of the libraries is a list of stop words and it's like i me myself we our hours ourselves so as you can see like they don't have any connotation in terms of actions being done or um identities so basically if you take these out uh text is easier to analyze next up frequency distribution so i think this graph explains um what frequency distribution is like if we have a book for instance harry potter um you can study how often certain words occur now once you get save rid of all the stop words and perhaps you categorize it so that you're only looking at um either unique words or or um what do you call that like a name proper nouns you could get a ranking that is remarkably informative about what's going on in the book so i've done that with harry potter after just getting rid of the stop words and you've got harry dudley dursley dumbledore professor mcgonagall so these are some of the key characters in harry potter which become evident just by essentially running a function that puts the whole list of words in a frequency dictionary and then prints it out like that sentiment analysis um so there's two aspects to it firstly there's polarity so polarity is how positive or negative emotionally a particular piece of text is so it could be negative one very negative and one very very positive um you've also got subjectivity which is biased and neutral so for example poetry would be you know either very positive or negative but it would definitely be very subjective so the more descriptive adjectives and descriptive nouns you use the more subjective your text is so if you put in like a a manual for i don't know a lawnmower you would expect you would expect subjectivity to be close to zero but if you put in an article about the us election you'd probably get something closer to one so pos tagging which is part of speech tagging this is something that makes me feel like i actually don't speak english um because that categorizes uh every word in a text into a part of speech so and is a coordinating conjunction so enough for something completely different so like yeah and this coordinating conjunction and now is an adverb for is a preposition [Music] something is a noun and so on and so on so um in terms of libraries this is the one we're going to use for uh sentiment analysis because it's the pretty much the easiest out there so it's a collection of tools um by the way it works in repple so this is one of the few libraries that works in like web based python um ides and yeah it contains functions sourced from nltk and pattern for the most part um so anything else to say about it um yeah it also has the ability to do tokenization pos tagging as well but i only use it for sentiment analysis now this is um this library is going to be something that we use throughout so if you haven't if you're running python on your own computer uh you could just pip install textblob or you can or you can um just do that through your jupyter notebook so it does something very interesting it um it basically tags every adjective and emotive noun in the english language and the way that that was achieved um you could have a look at that link because it actually lists like every word the way that that was achieved was through study of um restaurant reviews from yelp [Music] so basically they correlated adjectives with star reviews and then they removed words that have double and triple meanings and they came up with something that is about eighty percent accurate so i'll show you some examples from textblob um so if you write the word happy for example it's going to have point a but it also if you write a much larger sentence i and today have no uh no polarity so it's still going to be 0.8 but if you add prepositions like very that's going to increase the amount of happy you could write not happy and that's going to turn it into negative and so on so it has some sort of grammatical intelligence but essentially it functions around um having individual words have uh emote like emotion based tags okay so we're also going to use nltk which is the primary um basically it's like the leading platform uh for building python programs to work with human language data so uh nltk is basically the leading um natural language processing library in python we're going to use it for tokenizing and tagging our text so it also can like give you visual outputs and display parse trees which are quite complex so beyond nltk we're also going to visualize our programs using something called pandas and matplotlib so these libraries are pretty hot in in data science and actually science in general and all they do is they um well pandas uh can do all sorts of mathematical calculations um i'm going to use it to create data frames um which can then be represented in graphs so it's a high level data manipulation tool developed by wes mckinney built on the numpy package and its key data structure is called the data frame so matplotlib is basically a plotting library and yeah it creates pie charts graphs line charts as well as different three-dimensional um visual representations so i've covered what natural language processing is i've covered some tools that we're going to be using so now i'm going to give you just a couple of really brief examples i want to make sure i'm sharing my computer audio no no no no no share computer sound of where this is used in modern technology so this video is taken from a google demonstration of google duplex um by the way like if you have any of those uh smart assistants like alexa siri google assistant they use natural language processing to analyze what you say so everything you say is converted to text and then that text is analyzed and this is what it looks like when they work really well [Music] so happening out here hi i'm calling to book a woman's haircut for our clients um i'm looking for something on may 3rd sure give me one lesson now if i've made it clear this is a this is a chatbot voiced by a person so i i believe that it's basically mostly synthetic voice and that's an actual human being and google assistant here is making a reservation and this person thinks they're talking to somebody's human assistant sure what time are you looking for around at 12 p.m we do not have a 12 p.m available the closest we have to that is a 1 15. do you have anything between 10 a.m and uh 12 p.m depending on what service she would like what service is she looking for just a woman's haircut for now okay we have a 10 o'clock 10 a.m is fine okay what's your first name the first name is lisa okay perfect so i will see lisa at 10 o'clock on may 3rd okay great thanks great have a great day bye by the way yeah if you have uh if you have a google assistant on your phone and can already do that that's like the duplex reservation section i'm not sure if that's still on with covid because of restrictions but yeah that's pretty cool something else to show you um and this is this is sort of thrown a wrench into the whole study of natural language processing because currently the most advanced tool for natural language generation is not using many of these techniques at all it's a neural network that has read basically um unimaginable volumes of text and it's a word predictor so you give it some words and it starts predicting the next word the next word the next word the next word so this is this was uh this has gone through two iterations uh gpt2 was last year so this is a short video about a paper from gpt2 and gpt3 came out this year um i might before we actually start with the book analysis show you the briefest of demos because i know it's a tangent but that's like the coolest thing i've seen in a while but i'm going to show you gpt too so and now let's have a look at how it fares with the text completion part this part was written by a human quoting in a shocking finding scientists discovered a herd of unicorns living in a remote previously unexplored valley in the andes mountains even more surprising to the researchers was the fact that the unicorns spoke perfect english so here's the the text completion part you give it that and then it starts predicting the next word and goes like the scientists named the population after the distinctive horn ovid's unicorn these foreign silver white unicorns previously known to science now after almost two centuries the mystery of what sparked assad phenomenon is finally solved dr jose perez an evolutionary biologist from university of la paz and several companions were exploring the andes mountains it basically reads like a story completely computer generated and somehow all it does is based on all of the texts that previously read reads everything that came up until point this word and then predicts the next predicts the next predicts the next predicts the next word gpt3 uh is so promising that they're actually not releasing it open ai is basically an institute that generally puts all of their stuff out for free but they're saying that the the potential for abuse of gpt3 is far too big so they are not releasing it so let me just um let me show you uh one application that you can play and this is called ai engine mohammed you could put that in the chat so ai dungeon uses gpt3 to generate stories and it'll be more fun if um muhammad why don't you choose stuff i'll go a new game single player you pick a setting fantasy mystery apocalyptic zombies cyberpunk pick a number why cyberpunk cyberpunk all right i never had cyberpunk before so this should be interesting do you want to be a cyborg a cop or an android or punk uh cyborg cyborg all right so this will be interesting enter your character's name um don't think too hard anything star okay cyborg uh let's call him neo i don't know because neil neil's a human a very human but okay a cyborg living in a futuristic city of zolly you have a bionic arm and a hollow band you're walking down the dark streets and neon lights flash brightly as you pass you see a person sitting on the ground he looks at you and nods so you speak to him hey hi he responds can i have some money please so you can say sure here is 200 don't spend it on drugs please and it will generate it he thanks you and runs into the alley a few minutes later a loud explosion is heard the man is gone what happened to him so basically this thing read thousands of cyberpunk novels and then is predicting the next thing you walk right into your car you're driving down the streets you know there's a small explosion before another explosion is heard oh go to the explosion you drive into a large entrance in the park and park your car um go to the top of the building and it generates the story you walk into the building and head up to the roof like how does it you know it associates top of the building in the roof when you arrive you find a man sitting in a launcher hello you say hey he responds did you hear those explosions you didn't he doesn't seem to be in a rush to do anything you sit there and relax while you wait for something to happen okay so uh by the way this is actually one of the less impressive examples of ai dungeon but um you should know that the most impressive language generation tool has absolutely no rules as i said it is a neural network that has millions of parameters and just predicts word by word the next word so back to our topic this is the collection of programs this was on github now i'm going to talk about tokenizing the text and i think there was a question from the audience about the the source of the json files so i'm going to explain that in a second so let's um in this case our book is harry potter and we can in here just open up the book and print it so i'm going to run the cell yeah that is the entire harry potter book in there right and i've saved that as a variable text cool i've got the chat can we access all these yes there's a github with all of these programs um it i can send you the repo uh it was linked at the start of the presentation um mohammed in fact can you go to that i'll give you the resource link right now uh head start academy and in here and then free resources book analysis all right so all panelists and athletes okay so in there is a link to the github repo with all of these okay so here are some um some functions that are going to tokenize this text so this function like i said in that presentation you have a string here that contains all of these characters and you can replace each of these weird characters with nothing and that removes all the punctuation right the string then returns um the new text which has all the punctuation removed um the word list sort of i've actually used the word tokenize from the nltk library to be honest i've done this about three months ago and i forgot why i used uh word or tokenize but you could basically use text.split and that's going to create a word list the only reason um why you would use this is that um sometimes like mr dursley um the the tokenizer from nltk might actually take two charac like separated text to be a single word so um but yeah there was a good reason why i chose the tokenizer from the nltk but for the life of me i can't actually remember it so um sentence lists that you do you basically do the same thing and there's a very good reason for this particular tokenizer because if you split the text on full stops then a word like mr is going to be a sentence right so if you split the text on full stops you're going to have sentences made up out of things like doctor or mr because there's a full stop there and it's so that the full stop isn't um splitting on a full stop is not going to make good sentences so whereas a paragraph list i mean this does the job you split the text on two spaces then that makes you a list of paragraphs so if there is no word chapter by the way um you can physically add the word chapter at the start of each chapter so generally speaking when you're going to analyze a book um yeah you're gonna need to double check if there are any punctuation symbols outside of here and if you want to separate out chapters you know you got to think about how you cut that text and splitting it on the word chapter is probably the best way and if that word is capitalized then that's pretty much foolproof so here are four variables um words sentences paragraphs and chapters so now like that we're going to go embark on messing around and trying to find out you know who the heroes villains are and how the emotions flow through the text we can use these four variables um but in order to be able to use them in future programs i've basically permanently stored them in these json files and these json files are going to get generated when i run this cell so i should be able to yeah so they last time i ran it was yeah was when i did this webinar two months ago so i apologize if i'm a little rusty on some of this stuff because that's the last time i've actually done this so yeah the the the reason why i store them is because you're going to use word sentences paragraphs and chapters and basically permanently saving them on the on the server or hard disk or some local file is going to make our analysis faster rather than having to run these functions i mean they they take 20 30 seconds each and here are some just tests some test codes oops this is some test code for example um first 20 sentences in the sentence list so i mean yeah that's categorized as a sentence that's categorized as a sentence but yeah so these are sentences you will notice that sentence tokenizer left these full stops in and that's why i use the sentence tokenizer from nltk so um first 20 words in the word list that works the first 20 paragraphs i've created these sort of lines in between so yeah a paragraph is anything that has um basically got two spaces between it now some of this is um yeah i'm not sure if that is exactly a paragraph but this is not super accurate um paul i don't know if you can just let us know in the chat if you're running this through jupiter notebook uh it they should the actual open is what generates the lists depending on what ide you're using you can just create a blank words the json and then maybe that's how it's going to work for you all right um so that yeah last but not least chapters and they're quite easy to um to tokenize that was chapter one and i think there was a chapter two okay so now into the first interesting thing so we are importing the the following packages so we got the word tokenizer sentence tokenizer so we're using this frequency distribution from nltk we're using stop words and we are using the pos dagger part of speech diagram so all of these are nltk tools by the way i'm pretty sure we're not using these these are just sort of left over from the previous one so let's um yep as you can see i'm just loading that json from that previous file instead of using that using the the function in the first place so the first thing that i'm going to do is create a frequency distribution so that has every word in the book and how many times it repeats right so you use this frick dist function right here so that's gonna have harry that's how many times harry repeats butter repeats that many times two repeats that many times say repeat 72 times and so on and so on so now my goal is going to be to basically rank this so if i have all the words in the text ranked um this should give me something interesting right so you can use the sorted function in python so the sorted function is going to take the parameter of the dictionary it's um and and going to give you a ranked list in order of the most repeated at the top the default if i didn't say reverse equals true would be the least repeated so now you can go cycling through that list and just print out basically the key and the value so the to and a off right i mean these words are the most common and then harry and then wasn't it this he said etc this is not um this is not that interesting so we gotta start filtering out some words so what are we gonna do here okay so in this next cell we are going to um eliminate stop words so we're going to you know if you you can get them from nltk and i'm pretty sure that like this was code that i was testing that um you can actually if you use the download you can get particular [Music] i don't know parts of packages or or um sort of what's the best term files within packages with this download function i don't remember exactly how it works but nonetheless stop words are those common english words that carry very little meaning and this is where we defined a variable so this is going to be a list of those words and right here we are going to go through all the words in a ranked list and remove everything that is in stop words um you can actually also remove manually words that like didn't make that much um sense so all of a sudden harry said so sorry i didn't receive the link till quite late any links or info yep and there is this one as you can see this is a lot um this is a lot better i mean we got rid of the stopwords but it's still like said one got didn't like could no get so this isn't exactly um super useful but it's already a lot better than it was without the stopwords so what's next so next we're going to tag the words and that means that every word in that um in the word list is now going to be tagged as noun pronoun verb adjective etc etc etc and what we're going for and this is a python list comprehensions basically going through that tagged list so that each word will become a tuple with the word and the tag and now we are isolating proper nouns basically words with this n p tag so now uh if the word in the ranked list is a proper noun and it's not in the stop word so now we're filtering out the stop words and basically only taking proper nouns now we're going to print that ranking and now we got hannie harry ron hagrid hermione professor snape dumbledore dudley and so on so all of a sudden um what you really have is is the characters within the book i mean you might have like gryffindor or hogwarts so names of places but proper nouns are basically names of people and names of places and all of a sudden um you know you're sort of at least finding out who is in this book and where it takes place and so on and you can also just like we ranked all the uh proper nouns you can rank all the verbs and you know i mean to be to have to back to like backing is like a common verb in harry potter like no do think you would think that like there'd be magical stuff like casting a spell or something being the most common but no it's really like common english um verbs weirdly dragon mated there so that's the only like odd one again uh what's what i guess one of the common themes is no matter what your whether you are tokenizing or categorizing text it never approaches like 99 accuracy like the best you're looking for uh in terms of how clean your data is that that's a technical term is about 90ish 90th percentile so what was this last thing i did oh yes so the dictionary of all of these uh common proper nouns we can now use pandas and matplotlib so once you import these two libraries like that um it is a matter of these two lines right here so this creates the data frame um so the data frame takes in this filtered dictionary which is essentially what you're looking for what you're looking at i believe where did i create the filter dictionary right here so that is the list of proper nouns and non-stop words so um in the cell you're going to print that data frame so you're going to see what the data frame looks like that's it and then you are going to use the um the plot in terms of bar we're going to give it a title harry potter protagonist so there you go harry ron hagrid hermione professor so there's a there was a problem here and that i ran into and and and i did a bit of research and because you know if you have professor uh that could be this professor snape but there's also professor dumbledore once the good guy wants the bad guy or the quasi bad guy so can you um can you use these tools to figure out um multi-word names so uh this this is again a part of the um a part of the repo so it's going to be in the resources it's a function that i'm not going to go through in detail most likely because i don't exactly remember how it works i basically got it off of this thread on stack overflow but it just demonstrates how you can get a multi-word human name so have a look it's currently crunching all the text and printing them out um that star means that the cell is still processing so it's using there is like a built-in parser that actually has human names so this is not doing anything from first principles but there you go mata malkin harry potter fantastic beast albus dumbledore first class grand sorcerer but there's some wrong ones like make harry make dudley like make make is not bonfire night albus dumbledore so again you're sort of seeing that pattern about 90 accurate so we could have used these multi things for for human names and ranked those and then compare that data in fact we did i believe i have ranked them here so as you can see uncle vernon is the professor mcgonagall and petunia mr durstley so these are some common names and so or all of a sudden when you start ranking it all of that like goofy stuff like may carry and make dudley um goes away so now going back to the sort of tokenization we had chapters now i'm going to try to look at sentiment analysis now sentiment analysis is remarkably simple in terms of the amount of code so this is the import from textblob import explode and now i've got this chapters file again which was saved in that json i could have just if you remember generated the chapters with this function um and i'm now going to create a dictionary where basically the chapter number is the key and the chapter is the value so let's and then i'm going to print out the polarity meaning the emotion i'm surprised i didn't print it out i must print it out in the next cell so yeah chapter zero um yep because when it slices off anything above the word chapter is gonna be chapter zero so that's fine so chapter one is slightly positive chapter two is slightly positive so chapter three is slightly negative and so on so this is it the number of the chapter and the sentiment the value between negative one and one of how emotionally positive and how emotionally negative the chapter is and now i'm gonna put that in a bar graph and there you go that's like the emotional arc of harry potter um happy though crisis vapi some crisis some crisis right before the end and finishes happy so they like you can like see the emotional flow of the book just by sort of taking the sentiment analysis of each chapter and just for the fun you can like print out the line graph again these are remarkably easy like you so you create the pandas data frame from here so the parameters are uh you know the dictionary that goes there the orientation and this i lock basically just says go from 1 to 18 i wanted to remove chapter 0 because there's nothing in chapter 0. so yeah that's how that stuff works so now heroes and villains so um let's go through this so we're gonna use paragraphs words and sentences and now we're going to create this this fun fun fun function called protagonist score so the way it's going to work is you're going to go uh put in a candidate which is a word a bit of text like a name and you're also going to give the book in tokenized form for example a list of all paragraphs and then it's going to see the emotion associated with that character paragraph by paragraph and give it a higher score when the emotion is extreme and a lower sort of positive or negative core when the motion isn't extreme and then it's going to return the score so now i um these are like the four houses of harry potter and you're gonna see how how this protagonist score works so like gryffindor is with the harry potter's house and he's the hero and like that's the most positive slytherin are the evil guys but funnily enough all of the other two houses are also even and the ravenclaw i think these are the sort of cold-hearted intellectuals hufflepuffs are really nice so the the the ravenclaws uh these cold-hearted intellectuals seem even worse than the slytherin who are you know the the sort of selfish career-oriented people in in harry potter or i guess they're the servants of the dark lord voldemort as well so you could basically put anything as a parameter here just for the fun of it you can check out right here like hagrid who was a helper of harry potter who's a good guy you can check ron who's harry's friend [Music] and you're gonna find out if they are heroes or villains so let me run this whole thing again and yeah they're definitely they got very high scores both of them much higher than that almost yeah much higher than most houses so both hagrid and ron are correctly assessed to be you know positive characters on this protagonist like hero villain score protagonist score so what do we do in the next cell um right here we got a dictionary that i'm forming called protagonist index and right here for candidate in ranked list so that was the ranked list of all the characters if it's pronoun and everything i am going to print out the protagonist index of that character so how positive or negative each one of these people is so where where are we let's uh let's print that out as a dictionary so that that should be the name and then their protagonist score in order at which they're ranked in terms of popularity so when we print this out harry should be on the top and then professor and so on and so on and so on so let's um yeah so yeah harry ron hagrid hermione and so on um and i'm sure that in the next cell i print that thing out as a bar graph so the most positive person is actually dumbledore which is the kind-hearted good professor that takes care of harry throughout the book no surprise snape being the villain is correctly assessed so i mean like never long but never long bottom is a good kid that bad things happen too so basically the only sort of flaw of this um the only character here that's gotten completely wrong is never long bottom but it's not really completely wrong it's just that the um if bad things happen it could be that the chapters or the the character covered is a villain but it can also be that bad things happen to a good person but yeah you know that's still uh that is still about 90 accurate in identifying heroes and villains in harry potter and maybe even if you hadn't read the book you'd you would be able to talk to someone who did and and make some you know sound predictions about it finally um character journey so this is you can think of this like uh any character in the book and their emotional arc so like what happens to harry as the book goes on so i recall when i built this i had a problem because a lot of characters like you don't hear from them in like two chapters so they appear in chapter one and then the five and then eight and nine so what i did here is i've created a list of sentences so this is like as long as it needs to be a list of sentences that the character that mentioned the character so instead of running through it by chapter by chapter it's basically you create a new giant list of each sentence the character is in in the sequence of the book and then nope that's just that so far is just that list of sentences that the character is in so after that um i have this function called a character journey and then character journey cumulative so this is basically character journey where the sentiment is just a value for the sentence or this set of sentences using the sample size here so sample size will choose how many sentences and here it basically character journey cumulative is exactly the same function whereas just adds up um all of the the emotion and let me run that these are just functions nothing's going to happen so now i'm going to pick uh hermione and these are this is like a section of um this is basically printing samples uh from the sentences that are about hermione i just wanted to read them now here though is the actual character journey so i don't know who the character is journey of hermione so you could put any character um in the book right here we have identified the protagonist to be hermione we can rerun this with a different character but yeah the hermione starts off mutual and then becomes negative then positive and she comes really good at the end of the book she keeps annoying harry i do remember like this has been so long since i read it mohammed have you read harry potter uh now i've seen harry potter but you've seen that yeah right right um but yeah like that's basically like if you've grown up in in australia or the states or england uh unless like you're i don't know in the states like the fundamentalist christians are up against hate harry potter um but uh yeah so you could basically um i hope i haven't i hope i've explained this uh sufficiently well but um going through that instead of going through chapters you just go through all the sentences that the particular character is in and you see their arc so if we say harry and then run this and that whoa so for harry i mean this will be interesting maybe we can change the sample size instead of 10 sentences to make it 50. and that's going to be the cumulative and i wonder aha so i can change that to jc and have a look at what the cumulative one is like is that printing or not so yeah that's harry's journey so it's kind of sad at the end i think i don't remember why to be honest but let's actually change the journey to the cumulative one and just see how the graph is different so yeah overall harry is experiencing um good things throughout more than bad things so when things go down he's experienced something good but there are the harry is consistently associated with positive emotions so if we were to put in a snake here i'm sure it's going to be the graph is going to be going in the opposite direction yeah it actually goes up but it stays negative cumulatively snape is still negative um and again snape doesn't appear all that much so i've probably gotten the sample size way too big so why don't i make this 10 and then maybe snape is going to get a more meaningful yeah that snape's graph so he starts off being super bad and then he kind of gets i guess plateaus in his badness and becomes less bad but um generally speaking like this is sort of a way to data science your way around the hero's journey and the way that heroes journeys typically work in books is things are good and then something bad happens and they go down and wrap up back in in sort of the positive domain so um i don't remember what that last data frame was ah because that was both the graphs i believe that that was the j c that was the cumulative one yeah i don't know why i was changing this when i had them both but there you have it this is a good set of tools to play with by the way let me show you one more thing um project gutenberg here's a place where you can get all the books you want in terms of uh getting text files for them uh the homepage has changed a bit but yeah like these are some books i've gotten here so if you type in alice in wonderland it'll probably be five five published editions that you can get and then you can get a text file of that just pick one that has a lot of downloads and see you can get a plain text version oh alice's what about this one epub hmm french plain text yay so this is where you can get the text files you can download them and then you could basically plug different ones into the um into the algorithm and then analyze a different book and as you can see like most of these text files tend to have chapter written in all caps and then you can split chapters like that so that's quite uh convenient so the link for project gutenberg is gutenberg.org that's where i got the harry potter books from um weirdly project gutenberg last time i looked didn't have the harry potter books i don't know if status of harry potter books changed but you know there you have it so yeah that's uh that's where we're going to wrap up i uh i wonder if you guys have do you have to clean gutenberg book files um before loading them usually you just there's three four versions of everything paul and then you just find the best looking one one thing that i've had to do uh is when i run this function i will find that there's some other weird symbol but then whatever weird symbol i have that hasn't been cleaned i paste it into here so there is a remove punctuation function built into nltk but i prefer to do it manually because books tend to have just an odd weird symbol that just lingers in there and then you do clean it but yeah also paul you've you've sorted yourself out with the json files right that's all good got it sweet yeah paul looks like he's uh he's on top of his uh python and data analysis stuff um cool let me just yeah we hadn't download some nltk stuff cool i would also try like google collab has uh is becoming probably even more popular than jupiter notebook so you could check that out as well um it has a few more automatic imports than than jupiter so yeah guys that's going to be everything i will hang around for another couple of minutes if more questions come in um i'm of course happy to answer them uh and yeah by the way one question to everybody there if there are any teachers let me know in the chat because what i do for a business is um i do ict extension basically coding extension teachers select small groups of students and i mentor and teach them python throughout the year i also teach myself part-time and that's my part-time occupation and yeah i've been doing that a lot since covet hit um so if there are no questions we're going to wrap up in a minute i'm just curious muhammad do you have any questions uh no i'm i'm putting you on the spot like that to fill in but uh uh yes well thought cool cool um yeah again this is the only thing that i would tell people out here that i regret not not finessing a little bit um i've pretty much copied and pasted these from all the commonly used ones so the ones at the top are not necessarily all used in the actual um in the actual notebook so be aware of that um in case i don't get to say later thanks for sharing the materials organizing the session yeah you're welcome uh you guys take care i'm just going to stop that youtube broadcast i don't know where that is and we are going to wrap up now thanks everybody bye
Info
Channel: Sanjin Dedic
Views: 464
Rating: undefined out of 5
Keywords:
Id: NCpQqPQq1M8
Channel Id: undefined
Length: 57min 31sec (3451 seconds)
Published: Sat Aug 29 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.