Emil Hvitfeldt- Text Preprocessing in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody welcome come on in we're going to get started in just a little bit uh the room is filling in everyone's having a great uh start of their um their hot back summer i think this thing everyone's saying right now but i was doing well hope everyone's um starting to feel a little bit better from the past year um and starting to get ready to uh you know get slowly back to normal for those of you that are in countries that are highly vaccinated i know we feel coming from all over the world so people who are in other countries that are not as highly vaccinated i hope that you stay safe until um it happens um i hope everyone's just doing great with this year it's almost hopefully for the lights at the end of the tunnel i know it for an american perspective it looks like it's getting closer to the end but i know for other people it might be farther so good luck folks i hope i hope it ends soon for everybody first up as you always do at this meetup i always like to say who is hiring now as we've learned over the past year plus you can't all jump up and shout and tell me who's hiring so uh what we're going to do is if you want to hire somebody go to the slack channel um the link for slack r slack is i'll put it right here in the chat it's nyhakgar.org oh man i screwed it up already let me do it again folks okay that looks better i had two slashes before if you go to nyhacker.org slack.html there's a link to join the meetup slack which has lots of other stuff besides job announcements but if you're looking to hire somebody posted in the job postings channel job dash postings go in there post a job hopefully you'll find a good employee if you're looking to get a job go in there and look for job postings speaking of hiring my company lander analytics we are hiring and know that not everyone can announce it but since i'm on screen i'll tell you that i'm hungry um we are looking for a few roles we're looking for like a technical sales role or a full-on sales role too we have a few sales roles and we're looking for a data scientist linux administrator role so if you're looking for those jobs you want to come look work for what i think is a fun company and i think it's an awesome boss uh come talk to me you all know how to find me right on slack email twitter i'm findable in many many ways so come find me if you want to do sales or data science rules or linux administrator rules i think it'll be a really fun company to work for and i'm not biased in the slightest so i know we all can't be together and it's really sad but keeping up tradition i got my pizza all right this comes from joe's pizza i'll do the the cheese lock test cheese is holding on very nicely some good charring on the back side so i hope everyone is enjoying their pizza and let me know in the chat uh where you are um let me know where your pizza's coming from or whatever snacks you want to have tonight selena had a cheese croissant that sounds out that'd be tasty that sounds great daniel chen you miss new york city pizza i know you do look we're getting you back up to new york soon though um anybody else let me know where you're eating or what you're drinking or whatever you're doing let's we can't do this for real so let's at least try to pretend like we are being convivial together all right well while i wait for everyone else to tell me where they've eaten pizza from buffalo right uh boston your pizza looks like pork chops and rice that's very interesting looking pizza i like it all right um and we have nicolas from buffalo so anyone hope everyone's enjoying their food i know you don't come here for the food you come here for the stats you stay for the food that's what we'll say all right so a few things you can be doing virtually oh tommy's getting his pizza looks like a beer we have steak bob and ganus bowl that sounds great a pomegranate white tea in the dc area pomegranate yt sounds very refreshing devin i i'm with you on that i can dig that sounds very impressive i have tmv right now all right um some things coming up other events you can attend uh this week i believe starting wednesday actually might have already started i'm not quite sure i shouldn't know the specifics you can check the email i sent out to everybody last week i believe there's a conference for neo4j it's a graph database they have their conference and the i think their workshops were last week but the conference itself is this week so you can sign up if you want more information it is available at meetup.com in our event channel i think nicole will post it in the events channel or she's already posted there um just go there click that link i know there's definitely dan chen is here missing new york city pizza he'll be uh moderating some of the stuff there so you can check him out and check out this great great event speaking of conferences we announced it formally tickets are on sale the new york r conference we are very very excited to be hosting the new york r conference both in person and virtual in september we had a calendar snafu so we changed a few things on the schedule the workshops will now be september 1st and then the conference itself will be september 9th through 10th yes i know their week apart that makes it hard for people if you're traveling to new york city but everything will also be offered virtually so if you want to do the workshop virtually then attend the conference in person or flip it around or do both one way or the other you can so remember folks september 1st and september 9th through 10th uh the new york r conference and nicole just posted a link uh because i am a little challenged for doing that apparently posting links she posted it in um the chat r stats dot a i slash nyr stephen you're coming to new york that's exciting we're glad to have you there um and if anyone here you're all members of the meetup this conference grew out of the meetup those who've been around for a long time know that i said we do a meet-up every month let's do a conference and a two-day event so for you members of this meetup there is a discount code ny hack r same as our website that's our slogan our logo our catchphrase whatever the right word is use code ny hack r to get your 20 discount we are very excited about that and i see oh selena she's told me she's coming she's a former speaker oh and selena selena is giving artwork that will be auctioned off not sure she knew that yet but she did it last time uh selena we want to auction your art again so please uh get it to us and we're doing every year at this conference for the past few years since thomas levine started it we auction off art um made by the art community and all the money goes to the our foundation and i believe these another foundation i can't remember the name of right now but definitely our foundation some examples of the art we've auctioned in the past is actually um by jacqueline knowles behind me greeting from statistics i got this commission later on a painting by vivian peng and i believe behind emil we've also auctioned off in the past also uh the neural network layers that's like such a perfect speaker to have here with the art already thank you emil for having it all right other events we have coming up we have a stan workshop every year we've done a stand workshop with jonah gabriel andrew gellman rob sanguchi and a bunch of other rotating people coming through um we're very excited to do this again this year it's going to be virtual because we're not quite ready to open up in person yet july 14th through 16th i will hopefully successfully post the link inside the uh the chat to this let's see if this works all right that's the link to learn more about the stan workshop we do it every year every year it's completely different content me and my team attend every year and every year we're like oh wow we did not know any of that because they keep changing the content and we've actually seen attendees come repeatedly because they keep owning new stuff so if you want to learn about stan which is bayesian markov chain monte carlo come to this it's always a great time jonah gravity's going to be leading it i believe rob tranguchi will be assisting and adrian dominant will do his normal pop-in like he does every year we are very excited about that the day before the stand workshop we have the july meetup with sean taylor on july 13th that'll be announced in the coming days sean taylor will be i don't even know what he's been talking about but it's sean taylor so we're all going to check it out because it's awesome all right uh so sean taylor who's spoken a few times at the meetup before and he's spoken at the conference will be speaking virtually at the meetup july 13th so everyone you'll get that announcement coming up very soon then in august we will have ian cook speaking virtually at the meetup so we have the next two months plan it's so nice when things are planned in advance august will be ian cook we are really looking forward to that and then i'm really excited about this we don't have any details yet but we will be returning to an in-person and virtual meetup in september we always like to have a meet-up same week as the conference again due to our calendar screw up it'll be the same week as the workshop this time uh so september 1st or 2nd we're going to have an in-person meetup and we'll be doing it virtually and in person we'll do both to make sure everyone can still participate we're very excited about returning to in-person meetups to coincide with the in-person conference now we're not going in person until september because i know a lot of things are opening up especially new york city i think they're having fireworks tonight to celebrate new york city opening up again but we're not quite ready to do it yet one of the main reasons is we can't get space to host us yet most of our all the companies that usually host us are not comfortable hosting until september and even then we don't have it fully planned out yet we do fully intend to come with the meetup to coincide the conference in september but if you work for a company and you have office space please let me know if you would like to host the meetup physically and let us in your office to have us in there we need about at least a hundred seats more i think we'll feel more people gonna be coming um than usual we used to be able to fill a room with 300 seats but we don't have that much space anymore so if anyone has a company that would like to host the meetup let me know be in touch with me on twitter on slack by email by is the message by any way you want let us know if you have office space somewhere um preferably in midtown or below that's to make it easy for people to get to and get to the train or subway afterwards let me know if you can host us that'd be really great because we're always in search of space until then we are still virtual and we will continue to be hybrid going forward i'd like to thank ecohealth alliance for providing the zoom uh so thank you um noam ross and emma oh i'm forgetting your last name mendelson i think i'm so sorry thank you noam and emma for providing the zoom link for us this entire past 14 months it's been really really helpful so i want to say thank you as we get started if we have questions for our speaker we obviously can't ask the questions directly so use the chat right here in zoom or send a question to the slack channel called monthly meetup chat there is a channel called monthly meetup chat where if you guys folks sorry and you folks just want to talk about what's going on or ask questions do it in there and it was a question for a mill i will ask it at the end i'll collate all the questions and then i will ask at the very end and ask it on behalf of everybody because we can all jump right in so again if you want to ask a question post it here in zoom or post it in the monthly meetup chat in slack so that's a lot of announcements key takeaways you want a job posting the jobs eat your pizza neo4j conference this week our conference in september stand workshop next month meet up with sean taylor next month meet up at ian cook in august returning to in person and virtual for almost everything for at least for this meetup group in september that's a lot of stuff so with that i hope everyone had a good time um and hopefully i hope you're going to have a good time and um i'll bring us the next speaker who is very appropriate to speak on this topic given that um his history of text recipes uh and hope everyone's gonna have a good time enjoying this please everyone give a warm virtual welcome to emil thank you so much jared and thanks for having me it is wonderful this is such a flawlessly automized meetup continuing for years and years thank you i will be talking about test pre-processing my name is emilio wietfeldt if you have a hard time pronouncing it remember that both the h and d is silent i know it just makes life a little bit more exciting when names aren't pronounced as their win so just a little bit about me i am a clinical data analyst at cellular health i'm also spending a little bit of my part my own time i'm teaching at american university i'm teaching statistical machine learning and i'm trying to do useful society models it's very exciting i'm also a maintainer of almost a dozen different patterns on train the top is usually a role around color palettes working with tests and tidy models uh interactions and i'm also julia and i are working on the boot call supervised machine learning for test analysis just find a mouthful which wouldn't find a shorter title that really explained what it's about and that one's coming out very soon wait for the end of the talk a little bit more information about that i'm located in southern california and uh with my wife been here for since 2017 and i'm living here with my three cats and i'm sure everybody want to know they want pictures so here we have we have presto orion whittles their siblings they're around two years old and they're very happy here they're a little hard right now because we're having a heat wave all right so interesting aside let's step onto it so i thought it was appropriate to start this drone from hadley with him which states that most of data science is counting and sometimes dividing and i find this to be ring true quite a lot but i want to modify it a little bit because i also think that most of text pre-processing is pounding and sometimes dividing and this is really what i want to talk about today so the first question we really have is what are we counting so we have some tests coming in and what are we really talking so let's start with just example of some sets i have this there are catholics animals never did on train but it has a lot of really long form explanation of different animals so if you look at this example of what a fever is we see a lot of tabs different things and i'm stating earlier that we want to count something in this text so if we loaded it a little bit we might think oh we can like chop off like chop it around a little bit and you can count that so one way if you're just thinking it might be a good idea to just chop it up by sentence by sentence but what you really run into if you chop everything into sentences is most of them tend to be unique so the first sentence he was the most well known for the distinctive form building that is even rivers and streams is very much likely a completely unique sentence unless it's being topic-tasted from wherever i got this data set and any sentences that will be repeated might not have that much information so we want to find a smaller unit that is easily countable another thing we could do is count individual characters or letters in this case for letters and punctuation and numbers and we certainly to do this however we lose so much information it is because we took that a distribution of how often different characters appear but it's it it doesn't really tell us that much it might help us a little bit with language identification the tasks different languages use different words which then i then use different letters so we did a slightly different letter distribution but i just said that different words are being used differently across different languages and domains so maybe the idea of a word would be something that they can count out and depend on how we define the words and depending on how talented you might be able to use that information later down the line so before i move on i think it's very important that i give a disclaimer that everything i'll be talking about right now will use examples in english and i think it's very important with any kind of time you're working with tabs to follow bender rule which states that you always should say what language you're working with even if you're watching english because the statement we solved question and answering in text is not equivalent to resolve question answering using english text so even though english appears to be everything that happens in nlp it's very important that we state upfront what we're working with and also the difficulty of different tasks related to nlp and working with text is going to be different depending on what language you're writing with some tests would be easier in english and some of them are really hard so it will be very hard to do anything with a generalized sense and lastly a small reminder that the language is very much different than text and this you might if you are testing someone that is significantly younger than you you might have noticed that sometimes you that disagreement about how to test with everybody i'm in debate i'm in that little middle spot of knowing that people older than me will use ellipses as a natural pause and people younger than me for using gypsies in testing as a oh that doesn't really matter that much so a lot of things differ so just because we're saying we're solving this thing it's very important we're working with text and not languages and another thing just drawing outside of english there's many languages that doesn't have a written style most likely start as spoken language and then some of them will turn to green so the goal of this process of test pre-processing and we say test pre-processing because it's typically something we have to do before we apply some of our statistical machine learning models is the task of turning our tests so imagine that big paragraph we saw before it needs to be turned into numbers and more specifically it needs to change numbers because we need something that's machine readable most of those adjustable methods and machine learning methods requires the input to be numeric one way or another and even the some of the methods that can't say it's things like that variables will actually turn it into numbers down the line just hide it for you as a user interface helping and what uh i like to find out in this turn text numbers is that there will be some kind of loss along the way we lose some of the information so this becomes a like a non-reversible transformation much like we lose some information from going to speech into text because we don't have any of the mannerisms we don't have tone we lose a lot of things from going from speech to text we'll lose more by going from test into numbers i also want to point out that a lot of the things we're talking about today will be language and implementation and mastered and here i'm meaning language both in the term of the spoken language we're working with but also the programming language we're working with but i'll be showing samples how to do this in test recipes because that is what selfishly i think is one of the best implementations that can handle some of these things so some of the existing packages to deal with tests in r this is what we're working with i want to highlight two of them we have a tidy test and that one's here and they deal with slightly different things society test is a smaller package and it does a really really great job of doing eda or what is also otherwise called test mining and also has some tools to work with top modeling but tidy test works because the thai device already exists so it's leveraging a lot of tools from the deep player tidier package that allows us to seamlessly work test and downplay so tidy test doesn't stand alone it advances the typos that eternal system whereas transceiver is more of a whole ecosystem approach and it can do almost everything it starts all the way from the beginning ingestion data pre-processing all the way to different kinds of modeling everything happens in one frame right now this is where i come in test recipes tetrasqueeze tries to handle a couple of different things it is limited at them to only do processing or feature engineering so it only goes in this step of have the data and turning it into numbers everything after that it doesn't want to do anything before it doesn't worldwide it is part of the tiny model framework in the sense that is an add-on package to the recipes package leveraging a lot of the the hard work that has been happening recipes and just literally latches on and uses all the great engineering that happens in there another thing it doesn't create any custom objects so if you can use recipes and learn how to use recipes even more or less seamlessly use test recipes as well it doesn't require a lot of knowledge of strange objects it technically has custom objects but they're all used internally so hopefully you can see and by being part of recipes it doesn't restrict us to only use tests as features so one thing i've noticed especially in beginner material around the web if you look at something like test classification something like that they will only show you how to use test to be features in some modeling later on without mentioning that you can also use non-test features alongside so this is where tesla's piece starts to shine and you might think that this is the whole oh there's a lot of competing frameworks i will develop a new one even more and i can't dispute this claim but i'm hoping that what i have here is a worthwhile companion in the uh text pre-processing ecosystem and just a couple of notes of why these systems doesn't work tidy test works really really well when you're doing test mining if you want to look at what's happened if you want to repeat the transformations later on you run into issues since it's in a long format it you need to take a lot of steps to make sure that it becomes that you don't lose the sufficient information during batch or wide format which is something specific handles francina's also breakfast the whole heathen system thing which would make it really hard which would make it a little harder to work with tiny models but you might did like a clash of either systems i have plans to interpret some of it later down the line but it's for me it made more sense to do a smaller building blood and the most important thing that test resistance do is they can learn transformation on some data and apply the same transformations to new data and this comes for free by using recipes and just to uh limit everything so talk doesn't end up being five hours i'm limiting myself and test recipes to work with uh tabular data so it doesn't do any draftings with everything starts and triples and then continues and it comes a little bit from my like personal development uh philosophy that around that something that good foundation then always worked in the cutting edge especially because i feel that some of these idea that i'm talking about in this course isn't explained very in-depth other places so this is what i'm trying to remedy so if you just look right here we have an example of a recipe and i'm using on the animal's data set so i just load it in and if you notice there's a lot of different steps but what you will see is we start with some steps about some other variables we have a fatter variable lifestyle and a numeric variable mean weight and we do some transformation on those and then afterwards we do some transformation to the test so everything can work together which is really the power of test recipes and what we're really going on is we have four steps more or less when we're doing test three processing first step is turning the test into some smaller units we'll talk those totans then we can modify the totals or remove some of the tokens we don't want and lastly we'll take the tokens into the amount so these are the four steps uh test uh pre-processing more or less about and i'll go through them one by one so first we'll sort of tokenize safety so that is almost always the first thing you will do when you're watching the test if you don't have if you're not the only really other thing you to to do is make sure that all your testing coding is done directly because that can be a very big happen down the line and totalization is really the test of taking a bit of blob of text and turn it into something smaller units and something we can count and this is some very important because when we're working with tests we generally don't have a fixed width that we can do so we have we can do a little bit in something light uh straight up as we're limited by 280 characters but what we can't what what account which is typically words for other units don't have a limit of how many you can do so it becomes different we want to find a way to count something so this is where we go to totalization and the most common token is typically the word so we want to find a way to date our test and find out where the word is but there's many different options when you take into consideration when you take our tests and send it into words and we're very fortunate thank you speaking people because the idea of what of what a word is is quite clear but and we can actually find that just splitting our string by white space works really really well by itself but you will run into different things that like what like a fire truck this is one word two words there is afternoon like there's a couple of words that feels like there might be one word for the technically split and so on and so forth but generally uh the idea that we can split very easily in english and this is not the case of all languages like that selena in the chat saying that turkish if things are a little bit different also chinese everything's different because the notion of a word can be one or more more one or more glyphs so totalization in english is quite an easy test because the default work quite well is to be a very different case in all languages so here i'm taking my b right sample and i'm just splitting it by white space i'm just using a string split and this red is just all by white space and we see that it looks fairly well with that what appears like words there's some weird things on it like like punctuations things should be lit down here the twits included the timer and strong period but more or less but more or less we've already did pretty well just by using spaces if you don't any time class but you know you've probably come around come around and certainly say the total nicest package it provides a wide variety of totalizers and the default totalized words works really well as well with that something that really looks like word does a lot of splitting unfortunately it is hard to actually define these rules so the tokens was uses the word boundary algorithm and stringy package and here is the short outline of what it actually tries to achieve and the default documentation is many many dozens of pages explaining all the little things why it works white doesn't quite because there's a lot of different ad spaces to deal with but this totalness was worked really well but it also is a vastly more complex method but with that in mind there's a couple of things we take into consideration first of all do we need to uh turn in the uppercase letters into lowercase we saw before we just do that we see in the first one beavers stay uppercase when we use much worse it does turn into lowercase so on sometimes where it's deforester generally to load is what does what information are we losing when we turn properties letters into lower text letters that will depend on your test do you really tell in many in english obvious letters usually means as a proper noun or beginning of sentence does that matter a lot if you're working with someone that likes to eat a lot if you want your test with people that really like apple products but also really like to eat fruit you might need distinction between uppercase apple and the lowercase apples but sometimes you don't so that is adjustment tool for you to make but it's a choice you need to make because that's it's a budgeting different ways to do it that may or may not uh should be butted together another thing is punctuation how do you deal with it a general default just to remove baby compensation punctuation but sometimes that can be important as well sometimes removing punctuation can even split up words that were all wise hyphenated there's a bunch of different translations right there and that's what i said before well how do you deal with non-word terrorists not what characters inside words and when we've got a non-word it's like what if it's not a outfla alpha just the alphabet parents character so in a word like don't it is a attraction but should we keep it as one word or two words there's a bunch of different possibilities right there and what do you do tampon world and multi-waves how do you deal with this it's like this is can be very important things like pronouns or other names of more things the white house is for many people one unit it's not quite in a house it is a specific unit and if you're not careful about how we deal with this it doesn't pop up as easy um so uh that's all problems as well and i'm not attentive the answer to most of this a lot of the translations right here will turn will be directly related to the data you're working with and what domain you're using but i want to highlight these are things that that should be transceived another thing right here we have a just a short vector of strings for flowers bush and flowers and we would imagine that if we tape on this we just say like two flowers and one bush but now that i run it we see that there's two different flowers and what is happening right here is the first flowers had a literature they're combining the f and l into one character and this thing looked really really nice for some types of presentations because it was a little bit more sleek but it can mean that you like basically it the computer doesn't know by default these are different so um that's another thing these words will not be town these two flowers will not be counted as the same words and you can but there's different ways of dealing with this as well a couple of more things one thing like how do you even begin to talk about slang and domain knowledge i like to think of the main knowledge to really emphasize what it is is by saying that everything you see on wikipedia will be vastly different everything you see on twitter the way we're speaking is different we don't have a word count on wikipedia if you use slang punctuation starts mattering a lot more in social media posts in a personal things so what how do we learn things together this is probably more of a post like more of an after totalization thing we say all these wows should be turned into one but well there are different vowels or cipher gloves will matter like 10 sometimes matter and also what you have is there's only one person doing this super long wow if we don't lump it in it might not have it might not show the signal if otherwise have if all the slightly different limitations are being counted separately and lastly to really like show why text is a pain to work with all the time is that how do we deal with emojis like emojis words are they modifiers here we have let's get some tattoos we're just taking the tattoo replacing it with the emoji for tattoo it appears like this most people can easily understand what that is and this is just a real replacement but the test i love you a little red heart everybody also understands but the the know what quantities this emoji has is quite different no one is saying i love you heart that's the thing it becomes a modifier to um through sentences so it can change quite a lot so the the phrase you're out of your mind sad like like angry face is quite a bit different and you know out of your mind happy um happy hearts face uh and this a lot of this can become this like this idea of emojis i think can a lot of times be used to inject passion of the tone that we lost in turning something into the test and what i'm really trying to say is that the domain you're working in matters so uh preprocessing test pipeline that works in a little test framework it's not going to work as well when you apply to social media posts or to to job post things or anything else it just depends and you need to know where the differences are you need to know have someone on your team that understands you're doing so you can adequately adjust and the way the test is trying to deal with this technology that there's many many ways to do this and we'll let you uh do what has a couple of different defaults and helpful ways to do it but doesn't tie you down um so we defaulted tokenizers you can pass in your owns we also have bindings to other languages or even packages like other languages and other packages and even languages so uh default tokenizers service words you can also pass in some of the arguments if you don't want a strip punctuation and you don't want it anything to notice you can modify the original um certainizer if you wrote your own tokenizer so if i wrote a function called my it makes a totalizer you pass it in and that tokenizer will be used and it will be saved in the recipe so when you apply it then later it will know we can also use the sparsi tokenization engine sparsi is a wonderful python library to do test test word and the sparsia package tracer binding in r we can use that underneath but as well it did much more rich to organization with some laws in uh in speed sometimes there's also this tokenization idea of bytecoin coding it there's a different idea of what a word is it finds common strings of characters we have that as well and we even have support for methods where we have trained models that do totalize without type we can pass in a pipe model the oct model intel recipe and that is the one that we use and if you have any anything any other tokens you want support what is a little harder please reach out we're happy to add as much uh many useless as possible stemming so stemming is but as a title but actually what i wanted to talk more about is the add-on of modifying tokens once they become tokens so we started tokenizer it splits the test into many smaller things typically works but it can be other things as well syllables and we want to modify them in some way or another and the most common way you want to modify them is to reduce cardinality so we have a bunch of different tokens that some of them are kind of saying the same thing it could be um [Music] some lotis orbitals but we also have things like house and houses so is there do your model need a distinction between water houses but counting a house and topping houses if you don't want a distinction stemming might be one of those options to do so ah also in english we have like two main ways to polar stemmer which is this algorithm outlining the different steps you can do to remove ending so most of stemming takes a word and removes things from the end which was a very english thing to do because lot of beginnings means a lot and even more simpler ways of stamp is just removing s's at the end this is very much simple algorithm but can also work fairly well and the way i like to think of it is we have a bunch of different tokens and they each have a bucket but we're timing them in we're dating have some factory water that it in the corresponding budget so the stemming is more or less the act of combining budgets that should be the same a bunch of differences doing this here i'm just showing three different like four not three different ways first tell them we're seeing the original word uh the second column we are removing the s verb telling we're working by like like plural endings and the last one we're doing the portal stemming and we see that there's some differences in what happens to these words noticeably the porta stemmer doesn't always produce valid words in the end and it's doing this to help combine things we see that a lot of times y turns into i if it's in the end the porous demo can remove many characters but it's a way to try to combine things back down and different methods a few different things the porta stemmer has other foreign removal algorithm is very english so it probably doesn't work about power languages the polar stemmer is tailor-made to write in english but there's other um there's similar algorithms written for other languages and this is just another example of why it's really important that you need someone on your team that also knows the language you're working with because it could be that if you applied this remove ending as algorithm some other languages you might lose a lot of information this would be like the difference that everything makes for certain words and these are very easy to apply as well the test piece defaults to the portal stemmer which is implemented in the snowball c it is fun factoid it's tall snowball because originally they wanted to call the nephew stripper and someone told him that knight did not have been the best idea somebody emit that so they found a potentially less offensive wire so they want to snowball yeah so it can default to apply to the the border stemmer uh you can also pass in the phone stemming function if you have ones you want to remove ending ss you can do that easily as well and there's also this idea of limitation so this is a so most stemmers will be some kind of algorithm in like the if else statement waste is hard coded and it is very simplistic and more importantly it was very fast because a lot of them can be written as a record or two limitation is the next step where we are trying to a lot of times build a model that learns the structure tries to learn the patterns and the data and removes some endings and we have a couple of implementations of this we can use asparsi the implementation has trained model in it and a butt pipe also is a trained model so the way we can do this is if we set the engine uh to sparse here in step totalized then you can later use step lemma and it will pull out the limitation of those words so the first limitalization happens at totalization time using spatial we are it is it's strengthened out this way and you can do this for all our limitation methods so far we only have support for starcia and butterfly but i'm trying to add more when we go along and there's all the things we can do you can also pass in custom functions to step stem to do any kind of transformation you want the next thing i want to talk about which is total start works not asking you to stop um i want to say a controversial topic mostly because there's a lot of ambiguity of what it really means and this is really my session why i want to talk about removing wires so we started off splitting things into words then we turned and modified some of them mostly with the remove endings but that's the thing you can do in english there might be other things to do other languages my stock works section is we want to remove words so i want to stop at 24 stopwatch and to really explain what stop words are let's just go to the internet because that's where everybody gets all the information from all the time so the first one we have the natural language processing useless word data are referred to as stop words in computing stop words are words that are filtered out before or after the natural language data are processed or stop words otherwise in any language which does not add much meaning to sentence it can safely be ignored without sacrificing the meaning of the sentence so this is how i tend to look when i read the statements because it gives the illusion that working with stop words are very easy and if you just remove all the words it is no problem whatsoever another thing some of those accepts i saw was sometimes in even larger articles or books most of what will be what they thought about in the is all the discussion about starbucks in uh in julian iceberg um we did see the whole chapter just to stop works because we felt that we really needed that kind of attention and that's really why what is stopwatch video in my view i think that stop words are low information words then that contribute little value to the test at hand and i think that the information over it leaves us on a continuum there's two things i wanted away from the first sentence first one is low information i mean a lot of different things i'm never saying something that's no information something low and it's the value added is dependent on the task so let's let me show what i really mean for some pictures so we look at this test on the right side i'm doing some visualization and each rectangle represents a word and i'm saying hi words will be bright and low high information words will be bright low information words will not be bright the first thing to think about is all the words containing the the same information how they know like that definitely are some words that contain more it also doesn't appear to be random about it what happens and like there's no way of saying or not but as told of always it totally would have a high variance information to like the diamonds in the rough so most of the words no information tuple here and there a lot of information that's one way you can have it another way to be this like low variance kind of information where we did little clusters of high information and like thistles just a different way of thinking about it and ideally what we think about we have these words and some words have a lot of information and some words have less and less and less and removing stuff stopwatch will have this total line so we think that we're removing stopwatch and just removing everything else but in reality you probably end up with something like this so you're hitting a cut off and you're picking most of the no information requests but you will eventually unless you're very very careful also pick up some high information wires you can always slide it around to try to hit like different sections or where it is but it's hard to really find the right total so we handle this in two ways or combine them we have pre-made lists so that you find a stopwatch list out there you pull things out of it or you can hand make the list alright with handmade handmade thing you just find all your words and fine make the list all the words you don't want to include and remove those because there's a bunch of problems with hangmate list in general i talked about handmade lists as very easy well constructed there's very little ambiguity but there is quite a lot of different english topic lists out there and this is just some of the ones out there another thing that can become very important is that the stopwatch list are sensitive to the way you totalized it the way you tab utilize the words and if you did stemming before or after so if you're using a previous stopwatch list make sure that it matches everything that happened up to that point sorry like that if you ever need to work with a non-english subway list basically hire someone that know that language so you like use the language and that's even just generally if you wear textbooks test have someone at least some one person on the team that actually knows the language otherwise you get into a whole host of trouble and when it comes down to i want you to look at the stopwatch list and if it wasn't for time i would have to slide five more times repeatedly adrenal jumping out at you because this is really really important you need to look at your stopwatch list and i'm having a little twist to explain why so i'm trying to function stop word twist so i'm giving you half a minute to type in the chat which one you think is the odd one out in this list of words if you want to you can join the chat say which one to which one of these four words doesn't appear to have a relation to the other words and it's fine if you don't want to but it's just oh wait i started the wrong timer but in my china start on mine so the odd one out here is she is and this one spot one up it doesn't appear on smart stopwatch list whereas the other three appears so that's already like hey that seems weird that the two first stop two first words and this list doesn't appear together next we have some other ones you then have half min if you want to join we have vowel b fifi and system 1 i'm looking for the one word that doesn't appear to match anything else seeing and there's no way of you actually knowing which one is unless you've seen this talk before because it will be like all these words how we are like they shouldn't none of them feels like no information words and the tourette answer here is 50. because 50 was part of a stop work list included inside headline that was undefeated for three years so everybody that used this stopwatch list using siteheadland we moved 50 but not 50. and last one we have these long words i would say that all of these words appear to be highly informational like it seems weird that these would even be anywhere close to a stopwatch list but nevertheless that's why this part this question is on this twist and yeah so like we have to substantially successfully sufficiently and statistically which one's the weird one and with that is statistically that error correctly found in the chat because this one doesn't appear in the stopwatch iso list whereas the other three do appear so if you're using the iso include stopwatch list you're having quite a bit list of wires that would be removed um so general idea removing words is removing words you don't want to do one of the things to be high frequency right high frequency words tend to not have a lot of information in it just wires like ah and or it the they that was appear a lot i did work more gluing today we can also remove low frequency words because if it only appears once in a hundred thousand documents can you really use it as signal for later modeling positive noise and it also removes stopwatch if you use steps upwards it removes the snowball set i was very hesitant about having a default just don't think that removing stopwatch is always necessary but the snowball one is this smallest most commonly known one you can also specify other lists and this is using the stopwatch package so we can also have other languages lists as well i always admit creating your own it works better you can also filter according to things about the data so here we can say only keep the tokens that appear at least 10 times and also remove totems that appears too many times these are very heavy-handed methods so you need to make sure you're not removing things you shouldn't and for computational reasons reasons we can also filter to only keep the most common thousand two thousand words simply because when we turn this into a table in the end if you don't use a filter for the number of totals we get a very very large sparse data set that sometimes doesn't fit in memory so we start with ending which are totally betting which is not the stain that you spin bettings but what is basically turning totals into numbers and we can do this a couple different ways for time i won't be going too much into the math of what it's doing but hopefully it makes a lot of sense so the first thing we're going to do is count so i'm using step under cf term frequency and what it's doing is it's just saying how many times does each word appear in each observation so see right here that the word table appeared seven times in this one zero times in this one four times in this one and so on um this doesn't um we can also do a binary count so basically saying did this word appear or not sometimes doesn't matter if a word up here one times 100 times as long as it feel once it's signal or not fast that's one way of doing it another way to do this is using something called tf idf so term frequency inverse document frequency so we would say how often where it appears and multiply it with how not often it appears in many different things so the word appears in basically all the different documents or observations it appears almost always it has a really high idf and it doesn't weigh as much if it appears on their subset it starts appearing a lot it becomes a kind of normalization there's a bunch of different ways they have to be written now and we have most of the variations in this step as well there's also this idea of feature hashing so before we have our tokens and we just say when they appear but we have to exclude a lot of totals because otherwise we have lighter have like half a million tolerance and we can't do that and it has been super super sparse because most pieces of text don't contain that large part of possible words used so we use this thing called feature hashing so feature hasn't been hashing what it will do it will take each token to a hash on it that maps it to a number and then we pin it according to that number using a mod on it so right now we do hashing on all different words do mod 1025 and then the wires pop in right here and we can then in three here i'm saying like if we're increasing by decreasing the number of buckets in future hashing um we can with that some collisions so multiple different words will be ending up in the same hash after we take them out of it but it's a way of us finding a balancing act between uh by finding a medium um sparse uh implementation main downside to this feature hashing is you can't go backwards we also have classical welding beddings so this targeted samples are wet to that fast chest of glove very super simplified view is we are making we are building uh that we are trying to map all words all tokens into a multi-dimensional uh space so a point in space will represent one token and we try to build it to satisfy some conditions of distances and then we can use those vectors that represent each word later down the line um in our framework since we're working with mostly tabular data i'm starting for typical individual we can't really do that as much we have methods in test recipes that will take find the vectors for each embedding vectors for each token and we can then combine them together so we can assume them take the mean amounts of them they don't work well because it's just a constraint but sometimes it can be helpful so we have that here we have word step words embeddings you divide embedding and say how to activate it and you will do that this doesn't work well all right so um to kind of round it out one of the reasons why i really like these type based methods is because we did a great deal of interpretation on them so there's a lot of talk lately we just heard about algorithmic bias a lot of this thought is related to this large language models and one of the issues these last language models is they're very hard to tease out what how they make those decisions and one of the benefits of everything i've showed so far is most of these account based and we also know time that many things so it's feasible for a human to go through all the different words we're considering and see is this the word to consider or not so there is we can inspect our model a little bit easier we use these simple attack-based methods and also a lot of them they work pretty well like you can generally my modeling advice is you always start low do something simple even if you can start with a linear regression model and build up but sometimes you don't need the multi-headed multi-layer deep neural network sometimes something simpler will do it for you and everything i talked about works as a pre-processing step so everything you'll do after that will change how it is did before so are you using this in trouble modeling are you using to supervise modeling things will matter depending on how you do this last step of turning the totals into numbers you will produce something very sparse or something fast at all and the choice of model laid up on the line will make a difference if you want to know more about all of this uh as i said in the very beginning to the city and i provided this bird it goes much more in depth with everything everything i talked about so far covers like the first third of the bird more or less the other two chapters there's much more detail of all the translations you need to do how to remedy them then we go into how to apply use tests in a supervised sense both with deep learning and non-deep learning methods and it's available for pre-order now at most any online retailer that most online retailers that will uh most other retailers that would sell a very specific and that is all i have you can find me most everywhere online uh using meal with all of our thunderstorms anything else of course all the slides are created with sewing gym so that is all i have thank you very much a big round of applause virtually everybody let's see those uh clap emojis in the chat and i'll make sure everyone can see me a little golf clap here i see a few people clapping in there unfortunately we can't do what we do in person giving you real applause everyone here is talking for you so that's very nice for everybody so i see there's we have a few questions in there i need to ask the last question first because it's fresh and top of mind is please pat your last name again slowly so this person can sound smart saying your name all right so it's emil so if you're any of anywhere close to the southern border you've heard the name emilio it's without eo that's the easiest way i've explained people and my last name is whitfeld and if it's silent age silent d can you spell phonetically in the chat no i cannot so and it's a fairly rare name where i'm from as well so it's oh wow so that doesn't help and the person who asked that asked that in the panelists thing privately so i'm not going to name who it is but i hope you person um got that okay uh did ken brooks get it right a vidfelt yeah like i think that's the if you if you read it like that most americans will get it right all right cool i hope that's that question thank you for that um then questions about the talk um is there a recommended order of operation between normalization tasks like two lower removing punctuation and tokenization it's like what is the recommended order of operations there yeah so uh so those two specifically doesn't matter because doing like a tokenization isn't affected by capitalization in english so for those two it doesn't really matter but the order does matter really when you're working with stemming and stopwatch removal so some stop word lists most of them worked on fun stamped words but every once in a while you will have a stopwatch list that works with stemmed words another thing if that i think that happened in the sighted land methods was most words i think what it was that words that was only two terrorists as long as automatically removed which of course like you have to know that beforehand because you might everything else like breaks down if you don't don't know that so it a lot of it if you really want to know literally your stop word list and that's the look at your stop wait list look at all the words do they make sense and do they actually match out and look at the totals you have left after you remove the stopwatch to see if some of them slip through or not okay and it's funny you mentioned um looking at the stock word list i remember the tag text package had ways like to exclude stop words i believe am i getting that correct yeah so uh the the title package has this uh function that i think called that stopwatch that returns a tip of in using i think it's the iso smart and snowball stopwatch list all today and you just use a hand side join on it and that if you're using the tidy tits that's a pretty big stopwatch list compared to just using the individual ones so you might remove words that you maybe not have wanted to remove especially all those little that list will improve any kind of pronouns so if you need pronouns for some reason then don't use that that list all right this is coming from me now um would you say this uh uh text recipes supersedes tidy text no it's a completely worse alongside it i wouldn't recommend to use it as an eda tool because it doesn't it's not that transparent of what happens but once you know the transformations you want to do it puts in a rigorous framework to do it well okay but looking at what happens in the intermediate steps of quite hard all right um the next question is any plans to support hugging faces tokenizers in the future i have looked at it i need i haven't had the most time to look at it yet if someone already has a r package that binds to it i can add support for it if i have to add support for hearty phase totalizes enough it will take proportionally as long time to do it but i would like to have support for that because they do really pray for it so it sounds like something john harmon should do um he's in he's in the chat john harmon i feel like hugging face is your thing so i hope you will be uh making this this contribution you'll make a package of hugging face then a meal can uh use that in text recipes see open source working together yeah and that's i think that to that just hang like add on to that i'm not i'm trying to rewrite as little code as possible like i'm trying to do but the total message package has a lot of big church masses i'm not gonna touch i'm just gonna use it like just to limit my load that makes sense um that's a good thing that's reused you know open source um in this step where you talk about removing 50 um is there is a step spell check i have been thinking about doing it that the main and it isn't really that hard it's just a modification on it it probably will happen but i have it on my list of things i want to add it's always a little scary because you don't have a human actually changing like overlooking what spell chats are happening so it might actually do everybody have sent the wrong test it was spelled yet took over and it had to be scary to put into production but i will add it eventually because some people want that's a good point too um the question about your book uh does the book cover transformers and no it does not it uh stops like shy it then is a much simpler like we start down low and build up and we don't go that high really it sounds like a second edition thing plan ahead got a plan plan after the next edition another thing you're also limited of what you can do easily enough yeah so like that's that limited some of the choices we have um let's see that is a recent question if you have new documents that you want to compare to an existing corpus is there a way to incrementally add it to the embedding but he had to do the whole thing all over again so if you're thinking of like i think we're still live learning is you can update an existing recipe and that's a recipe limitation of test recipe limitations so if you have more training data is what i'm reading then you need to retrain the whole thing is what i'm reading and that is not a chest piece limitation that is a recipe's limitation to do that you of course if you're talking about you have some data that you need more transformations on you can always apply your recipe to these new observations and do the transformation there okay um there was talk about space cr is that a wrapper around the python python library yes so that is space here i just wanted some names i think this can be near it but half of them but yeah that is a wrapper around the it's like an ah api wrap off that uses fastly underneath the hood it can be a little hard to work with simply because it's really hard to install r directly python directly when you're waiting with r so once you get it installed it works well right so like that yeah so it just uses that and i'm trying i have spent a couple of hours trying to write a multi-raft wrapper to sparsi instead of going through the spacial wrapper because i feel like i don't need anything they need so they're trying to dispatch your pet is tries to do everything sparsi can do but they're not i don't need everything i just need these pre-processing steps so if i can do that we might have a slightly faster implementation um here's another question that just came in where does pos tagging come into pre-processing so uh we can do that as well instead this is included but if you're using like esparza any tokenizer that adds a part of speech tapping there's steps to do that that will like appear like fairly fast after you totemize but this is like almost an enrichment step so you're taking some of our totems and modifying them so like yeah like like somewhere in the middle but that one shouldn't really be affected too much of what happening around okay so somewhere in the middle um then um that reminds me of a question there used to be a package i used called mighty i believe i believe is an information extractor um i guess is that still like you say our information extractors um the old-fashioned ones made seven years ago still relevant in today's um with all with embeddings and hashings and everything else i don't know if i'm the right person to answer that um yeah i think that's the correct answer all right here it is it's might it's from mit nlp it's the mit information extractor and it looks like oh there was a commit a month ago to the right yeah yeah all right so maybe this bill is being actively developed all right um and then this is um a question that's the ud udpipe or i guess you however you pronounce that however you pronounce what does that get you versus um some of the other engines so that engine is a trained model much like despacia it's just a different kind of model but it gives you i think power speech and limitation as well so it's a more enriched tokenizer but it's a trained model so you need like the actual trained object as part of it all right and is that different than udp model or is udp model on a typo it might be a typo okay might be top type of the question um we just got another question in um what are some of your favorite feature engineering examples using text well so we have these general steps like that like these are like like almost like the vanilla ones we're just counting things but what we're talking about you can all i like if you can find some kind of domain knowledge so you can always craft all the things like can i count the number of nouns and i found like things like verbs can i but basically you don't need to only use these totals you can also add in other functions do different things calculate punctuation calculate like even proportions or things in it in the boot we have an example where there's a lot of censoring going on in the document of course if you actually what was that data you you might be able to find out the censoring but we had a lot of account that credit card numbers so they were like always like four four four four four four digits and if you did a normal totalization on that that would never that they would very very repeat but you could write as a function that would just count how many times do these trade top patterns appear so instead of having 10 to 16 different credit cards you combine everything into one so you might get some signal that way so i really like the handcraft box which is all dependent on whatever you're doing the one that takes the most work yeah which are the hardest ones to write about guitars and that's another thing i'm this is weird like different because i'm not working with text on my day-to-day so that's another thing is like i don't have right now i'm not lucky enough to work with ted so i don't have many recent examples yeah make sense one day one day anyone who has just throw a bucket of text let's get all the text we can send it send it to a meal all right i will give everyone else just one last chance to ask a question uh if you have one now now's your time to just slip it in right before the deadline uh for the buzzer because we are going to call it in a little bit we'll just do some wrap ups and no one else has a question i'll just give another 30 seconds while i while we wait these 30 seconds are a possible question i want to say thank you emile uh this is uh wonderful um slide it uh oh slighted i was trying to say sliced oh it's sliced yes if you want to talk about slice i know julia's been on there and a deer has been on there so yes go ahead tell us about slice i think that was a lot of slices right like if you ever watched the show chopped it i think i've heard it's a similar form because nobody actually watched chop but it's a live two-hour uh data science competition they have four contestants they all uh data data set that's too hard to do something on it there's hidden features so if they get special points if they do certain things that they don't know beforehand that we need to do and then they're judged uh we are wild in the chat and if i definitely i think they're starting an hour by half an hour on switch yeah i forgot that it was tonight so yeah folks go tune in next is is it d rob and julia on it again tonight or uh not uh not today okay i know josiah was on it not too long ago josiah two weeks ago julia a week ago next week after our people we have uh jessie master pat and dave robinson yeah a lot of friends so so folks these are people that have spoken at the meet up in the conference attended a lot so you can see if you can't see each other in person yet you can still uh see each other through through this way finding the chat yes and there's one more uh question we'll sneak in we'll allow this last one um do you have any general thoughts about function words i.e pronouns prepositions as a signal so like they are sometimes useful depending on what domain you're in so of course if you for some reason which you probably rarely need to do but if you really need to know someone's gender pronouns are a great feature for that but if you need to do most everything else pronouns don't mean anything right i would say like that and only like and like one of my tastes is i feel like you rarely need gender anyway like that's very rarely used at all so but likewise pronouns don't help us as much anyway but it could be indicative of the type of test so some types of like domains don't use pronouns or really use pronouns job listings tend to not use pronouns um that group chat probably uses them a little bit more because we're talking about each other so it then as anything else it depends favorite uh the favorite lawyer's answer it depends yeah okay right so okay that's again virtual round of applause thank you thanks for having me thank you for being here and this all happened by the way folks um we had a virtual meetup a couple months ago and i saw emil was in attendance i said hey emil send me a message you got to give a talk and look at this it happens it's happened a lot and i did i did yes yes it actually worked and this happens a lot there's been a number of meetups where i saw someone in the virtual crowd said hey send me an email and give a talk and it works so thank you emil for sending that email and uh being bold and following through on that uh to everyone remember we have a bunch of events the next few months we have the stan workshop july 14-16 we have sean taylor july 13th we have ian cook in august and we have the conference and workshops for the conference september 1st then 9th through 10th so thank you everyone for being here we'll announce the next meetup soon and this talk will be posted online um very shortly and hopefully i'll see a bunch of you in person coming up and those who not they'll still see you virtually if anyone wants to give a talk send me a message everyone once again another big round of applause as loud as you can make it i'm not gonna hear it for emil thank you very much thank you thank you everyone have a good night
Info
Channel: Lander Analytics
Views: 490
Rating: 5 out of 5
Keywords:
Id: kjA7LwaYYfM
Channel Id: undefined
Length: 87min 49sec (5269 seconds)
Published: Wed Jun 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.