Kaggle Live-Coding: Setting Up NLP Pipeline | Kaggle

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

starting the stream no sorry I'm having a big problem resizing windows today they keep going to fullscreen instead of getting smaller uh I'm real tired I don't know maybe you guys can tell that it's been a long week we had very fun week I enjoyed it a lot but fun stress still counts as stress as I discovered in graduate school um what were we doing we did this equals on our camp this week which was super duper fun really enjoyed it lots of great discussion I'm gonna try and spend a big chunk of time on the forums today answering people's questions and I was sort of like monitoring yesterday but I didn't get as much time as I wanted to to to chitchat with folks because there's just oh oh there's so much going on we got lots of fun surprises in store for you guys I am looking forward to being able to share all of them we got some like technical updates that I think will be really nice and everyone will get a lot of benefit from but there's a lot of stuff going on and also if you didn't know next week is a holiday in the United States it is the 4th of July which is you United States Independence Day and that is on Thursday so pretty much everyone's gonna be everyone in the United States is gonna be unavailable on Thursday and Friday including me so there won't be any stream next Friday I might pretty record something to go up in this time slot maybe we'll see I've Scott I've got a lot on my docket right now all right let's talk about coding so we've been working on let me head over to my kernels and there's gonna be a bunch of sequel kernels right now cuz we've been working on those and I never made this public let's make that public so other people can see it access that's what it's called is that not available if you're zoomed too far in that seems like a bug all right okay that is public now excellent ten out of ten we have been working on a series of projects Cynthia says hi from Bangalore - yeah I don't know much about Bangalore but I know it is like the Silicon Valley of India so that seems like a very reasonable first place for someone who would be interested in life coding to be from coffee okay we have been working on a number of kernels and I want to make sure that this is the most recent one that I actually wanted okay this is just the keyword extractor so let's see if I can find the one with the clustering I think that's this one so we've been working in a number of trials a number of kernels and we have just sort of been futzing about hmm I believe if we wrote more code last week so let's see if I didn't actually commit it because that's very possible and it's just in the kernel hello naman we are working on sorry this is like the most circuitous interest ever a way to summarize Kengo foreign posts so if you have been on Kaggle we have a number of forums here under the discussion tab with the Kaggle forum getting started product feedback questions and answers' datasets and learn and every day i read all of the forum posts in all of these forums maybe not the last couple days because I've been really busy but the number of foreign posts has been increasing very very very quickly so if we look at the medical data set which is the data set of cattles data basically so we don't include obviously private user data but if you wanted to know anything about Kaggle please use this data set that we've already cleaned and prepared for you instead of instead of scraping there's no need to scrape most of the data should be available and if we look at the number of form messages over time yeah oh yeah this is I imagined the two fourteen through 422 okay that wouldn't be that must be a bunch of competition forms that caused that spike what was I saying no I made it big I just want my window to be small and in the place where I want it and also I want all of the messages um hello hello hello everybody who's saying hello so I joined kaggle around ish I think summer of 2017 sounds about right so I would have been in here ish point when I started and now we're up here and as you can see the rate of growth is very very fast and I can't keep reading all of the forum posts myself I mean I'd love to but it's becoming a bigger and bigger and bigger and bigger part of my day every single day and I it would be really helpful for me if I still knew what people were saying but in a way that's faster for me to process and for me to pass on to the team members whom it's relevant to so what we're doing is we're building a project to take the the posts from these forums so we can actually see the texts from some of these forum posts here so this person's like here's some papers that analyze Eurovision voting patterns you might find some of them helpful some citations to those papers and if I could just have a way of saying hey a lot of people were talking about Eurovision right now that would be much more helpful for me going forward so that's the basic idea of the project that we are working on and so far yes excellent this is the code that we wrote last week I just didn't commit it so far we've been doing a couple things we've been working with a bunch of pieces separately so yaik is a keyword extraction algorithm and if I search for yet another word extractor that's what yaik stands for and this is the algorithm that we've been using as sort of data cleaning and also a little bit of dimensionality reduction so what this does is it takes a little bit w-where is their nice little demo page demo there we go it takes a big document and it selects out the most important key words and it does this in an unsupervised way so you don't need to do training ahead of time which is really helpful because I don't want to I don't want to be having to hand tag things and deal with training myself so let's see I think there are some example documents here yeah so this is what I believe actually is about kaggle so we can set the Engram size to 3 and that we'll look at sort of moving chunks of three words each and that gives us documents like Goldblum Anthony Goldblum to CEO and then Hamner CTO machine learning platform Google Google can go keggle Google it's about the cattle being acquired by Google so that's been working pretty well we've been we've had pretty good results with that we also did a little bit of tokenization x' we made all the words lowercase and we went through each post once we got the keywords and we took all those keywords and sort of treated them as a single sentence and we put that as input to the brown clustering algorithm so brown clusters are a type of again unsupervised clustering algorithm that assign words to clusters so here we have the example words that are similar to Kegel and this measurement is mutual information which you might know of if you've done much information theory or if you are in NLP and like the Audis and 90's Thiago says yay quirks for multiple languages it should by default it assumes that the text is in English but I believe that that is a parameter that you can change it should be multilingual which is also something that we want because sometimes people post in Kaggle in languages other than English and I I still want to know what y'all are saying alright so here are the words and here are the mutual information and last week we figured out how to actually we've tried to figure out how to actually get let's get rid of this how to actually get the clusters out and what we found is that we had a sort of one big cluster was being put out as the the object of this particular algorithm that we're using and we found this code on github and it is licensed under the MIT license so here is the link where is the information so the the repo is yuen yuen Brown clustering and here is the additional information about that if you're interested and it assigns words to clusters again in an unsupervised way and it is pretty unsupervised so it's not even like let's say k-means where you need to specify the number of clusters you think you're going to have going in it does it completely unsupervised which is really nice the downside is it is a little bit slower and it doesn't scale SuperDuper well which is part of the reason why doing the keyword extraction first helps because that if you just extract the keywords you have fewer total words so um Allah says even non-english letters such as the Swedish letters a with a little circle a with the mountain o with an umlaut I believe you may need to update the language models for Swedish if you are using Python 3 and Kaggle only supports Python 3 you shouldn't have a problem with Unicode characters in theory ah Colin says this is the third class we've been on this project for a while but I expect people to sort of drop in and out so I try to do a little bit of review at the beginning it might be the fourth one we went on this project for a little bit uh all right so we have one most go down a bit so it'll be a little bit clearer so the data object that we got out was an array with all of the clustered text and the way that so it was given to us as a single object with no way to tell the difference between the clusters but when we looked at it we saw that they were sort of alphabetically sorted and every so often the alphabetization would reset and our assumption my assumption is that every time that resets that should be the cluster boundary so last week we wrote like eight lines of code and here they are and this is to look through this cluster that we get out and to break it into its individual clusters as a list so that each cluster is I think it's nested list so each cluster is a list inside a list of all of the words in the cluster so go through and do that and that was pretty crest and if you look at the first four clusters we can see that we have data good the HRF and and actually let's look at the couple more two of test time file and if we look at let's say cluster 50 that's just the word score some of them do get longer [Music] I'm wondering if I have messed up [Music] I don't know as this is right uh-oh Sunil says that's not in that should help sorry that should help with the echo Colin says I've been following along and think I missed a class because I can't follow this have all the classes been at the this time yeah so there on Fridays at 9 p.m. Pacific which i think is 9:30 p.m. ist and 4 p.m. GMT um and they are every week except probably next week cuz I'm gonna be out of office uh and I need it my husband and I are going up to the mountains where they don't have Wi-Fi and I'm really looking forward to it okay I think we've messed up so just looking at these assuming that these are added in order we should have it looks like the boundaries are shifted by one so it looks like congrats gold and your should be in the same cluster Congrats I on should be in the same cluster so it looks like the boundary for each of them has been shifted by one so I think what we need to do is actually scooch this over by one but I think we will run into issues because it's going to wallet see I think we might get an error for the very first time we run through the loop nope it worked fine yeah there we go okay so now if you look at all these words you can see that they are in ascending alphabetical order these don't look like clusters these look like words extracted from individual posts all right so my plan for what we were gonna do today was take this working algorithm that works good and combine keyword extraction and the clustering into maybe a couple of scripts and then set up the scripts in a script folder and add them to a kernel and have like a little pipeline set up but I don't know as it is going to be helpful questions can yank extract non-english key words it could you shouldn't need to change the algorithm but someone was mentioning that there are you do need to change the let me see I'm looking for the blah I'm looking forward there we go so you can see that there is a parameter here where you do need to pass in the the language and I I don't know what language it's been trained on it is open source so you may need to retune it a little bit for your particular source language or mix of source languages but yeah it should work on non-english data as well Sunil says change rain also for for-loop okay I think that's that thing that we just we just changed hopefully uh and can I give any info on recent success on one-shot learning or zero shot learning uh no sorry it's not something I've been paying a whole bunch of attention to maybe I should look into it most of the zero shot learning that I am familiar with my backgrounds in natural language processing so I'm much more familiar with language based stuff that I am with like computer vision has been on zero shot learning for multilingual translations so I have a lot of information about how to translate from let's say English to French how does that help me translate from Telugu to Japanese that's sort of that sort of question all right let's play with the clustering parameters so right now we have it at three let's we have it at four let it set it to three because that's the Engram length and this one's gonna take a minute let's add it to three because that's the Engram length that the keyword extraction assumed and I don't know if having those match is going to give us better results but maybe it will who knows and we're just sort of in the trying stuff out all right oh oh so you can see I mentioned that we have things other than English and I definitely saw some Chinese and some Korean where was it yeah yes we've got some Chinese and we've got a little bit of Korean in here so like I mentioned we have languages other than English a little bit more Korean here one of these days I'm gonna learn how to read Hangul it's on my personal to-do list alright let's see how that changes our clusters that still looks pretty good those are words I would expect to be associated with Kegel oh yeah questions about sequel this isn't sequel sorry the summer sequel was a you know a different event this is just me working on projects so yes there were three days of sequel and this is this is a different thing uh-huh I don't know as hi beautiful the letter Y and function of the things that I most associate with the word deadline on Kaggle I guess high score and then submission model a lot test train people trying to go that seems much more reasonable and we have more clusters I actually wanted fewer clusters so that was not super helpful let's try cutting them up again and see if that helps and some of these are just stop words so we haven't removed stop words specifically we have done as we've done keyword extraction and in theory you probably aren't going to extract stop words as keywords here's a question I haven't considered before so do I want to consider all posts on the forums I sort of assumed that I did going in but my thinking here is that we may have problems with this measurement because it's a measurement of mutual information which means that words have to occur next to each other oh sorry I've tried I try not to make my gestures offensive so words have to occur next to each other in the line and if they don't occur in the same text and they probably won't be they probably won't ever be assigned to the same cluster because they're never gonna have high mutual information so I'm wondering if we want to remove very short texts and that might give us better results here because that means that the text that we do have are going to have they're going to be longer so there's going to be more words that occur together so we're gonna have more information to calculate the mutual information side benefit the clustering itself isn't going to take as long signal says oh the for loop is not considering the last element of the list it is not you are correct I don't care that much that sounds a little bit rude thank you for pointing that out but that is just the last word in this big how long is the list fifteen thousand fifteen hundred word list so I'm not super worried about that let's try this again okay okay okay okay so so far we've just our sample that we've been doing this with has been just some of the posts if I remember correctly yeah okay so what I'm gonna do is I'm gonna commit uh and I might even fork create a fork can't you fork your own kernels Oh some people have worked it or any of these public if people have been building on my work nope they're all private okay that's fine um it looks like I can't fork my oh I can there we go what cradle so I'm going to create a new version here and I'm gonna call this cle-cle see oh yay and Edie up round clustering I'm going to tidy this up a bit and then I am going to try changing the input data and see how that affects our results Muzammil Muzammil says I recently did word clustering using word - vac representation after stock word removal then cluster them using spectral clustering worked pretty good yeah that's definitely the the next step for me is gonna be a second approach just to compare the two and I am gonna do some sort of some sort vector training I might actually use fast text because I believe you can update fast text vectors a little bit easier if I remember correctly and then we were talking about spectral clustering and hierarchical DB scan and you map for the three versions that we talked about three clustering algorithms that we talked about once we have the vectors and the big difference between those approaches and these approaches is that we never stop using words in this approach we never turn the words into numbers basically they're just words throughout and my idea about the benefit of this is that it should help us it should help us be able to interpret our clusters better than once we do vectorization I don't know that that's the case but I think it might help I don't know man I'm trying stuff out I'm using the tried-and-true machine learning technique of guessing and testing ah William says have you used done any coding of storytelling system using deep learning techniques I could use GPT - or natural language storytelling I think we mentioned this a little bit earlier if for stories you mean like fun fiction neural networks are fine I mean Markov chains also perfectly fine much cheaper to train if you are looking for like news stories I would not use a neural system I would use templates uh and pretty much anyone I know who is using some sort of machine learning system to generate news stories especially around things like sports scores those tend to be templates instead of the other thing neural so that's my two cents there uh all right all right all right so I am gonna get rid of some of this I don't need the clustering without the keywords because I I tried that and the results are not as good as the keywords I do need to know know be bigger again uh okay that's just the size it is now I give up so I do want to move my Pippin stalls to the top and I'm just gonna tidy do this up a little bit because if we're gonna be working with this code quite a while I tried some stuff it worked okay but at this point I'm gonna try to iterate a little bit more quickly and that is going to be easier if my code is better all right so we can get rid of these cells I'm just consolidating cells I'm just sorting things so that they're in the order I would expect this is the keyword extraction so I do want that but I'm actually going to put this in to a [Music] function I'm gonna call this tab def keywords and I am going to need as my input forum posts oh no I want that out of there so I'm going to take this out sorry so what this code is doing is it is taking some of the foreign posts and I want to be able to modify which foreign posts I'm looking at at any given point pretty straightforwardly oh my god why can't I use my mouse good use a keyboard instead so I'm gonna put that up here and we're gonna have to deal with that bit just a little bit of refactoring it should also make this if you wanted to work on this code if you want it I'm gonna make it public and it'll be under excuse me Apache 2.0 so you are welcome to play around with it and then the thing that it's going to want is sample posts tokenizer okay okay okay so we have to do okay so we can get rid of these right skinny uh so we can get rid of these and this tokenization because that was before we went into Brown clustering the first time so let's get rid of all of this and then let's call this death token tokenizing nick and we are going to want from here sentences is the input I forgot something I do what to return sentences here so I get output from my function right now we are to respond to the clinch right now we are refactoring some code I wrote earlier so it's a little bit easier to implement a little bit up not if I might iterate a little bit faster and then we are going to want to return I'll leave this right now go away and then we're gonna want to return sample data tokenized so I'm just putting these things that I'm gonna want to do quite a bit into some functions so I don't have to copy and paste this code every time and if there's a bug I only have to fix it once and then we are going to call this no I think this can live outside of a function I think this is already you know modular enough and then I'm gonna get rid of all of these little things where I try have I okay uh-oh I can't download the notebook that's no good I guess I'll refresh okay it looks pretty much the same I'm gonna get rid of all of these little things where I am just trying stuff out and then I'm gonna call this def get clusters baby cuz again I still don't know if this is actually getting the clusters I'm just sort of assuming it is and we are going to need clustering let's actually move this out here and we're going to need mega cluster for this and that's just the list that's all of the clusters and you're assuming that when in our list of words we have alphabetical order starting over from the top that that is when a new cluster is actually happening all right oh right and you're probably gonna want some colons to define no-name clustering is not defined okay and that's because we haven't imported those libraries yet okay so this is a little bit better let's run from the top and I'm just gonna get the first hundred clusters so we're gonna need to install this yake library right now the output of the pipeline should just be word clusters so instead of getting and actually we can hide the input and the output here so this means it won't show once I compile the notebook and I'm also gonna hide the output while I am working so the output from this right now should be words assigned to clusters what I do not have yet oh I'm not actually running any of these functions that's fine I should get an error right around 12 as I'm running from the top okay there it happened so what I am expecting is a list of lists where each list are words and I want to so I'm expecting actually I think I have something that looks kind of like this but hopefully where the words in the cluster are more informative motor sir says what's the advantage of using words not number might that cause conflicts so here each word is assigned to a single cluster in theory so it's a vocabulary based approach and these were super common in NLP mmm maybe even when I started graduate school so to the sort of like early tens they were still the thing that people did and there's a lot of stuff around them that is nice and helpful and the thing that's most helpful for me right now is that it should and we want to return I keep forgetting to return things for my functions hold up okay yeah I was looking at cluster list here is that I wanted to return ah I don't know I was going with that at some point I'm going to need to look to these clusters as my eyes so it's not gonna go into a hole and I'll pee pipeline it's gonna go to me personally I mean there'll be a pipeline but then the the output is going to be a report that I read every day and I want it to be very interpretable so that I can understand what's going on and so a benefit of why I hope a benefit of using these lexicon based methods instead of embedding based methods is that I should have more interpretable output we'll see alright and I'm actually going to put these in a cell right down here and once I'm happy with this I'll actually move these to a script kernel and then I will import them as a little utility library to make my life easier so with that our pipeline should look a little bit a little bit better so I should be able to get my sample post soup get my sample posts and then get keywords [Music] from I believe it was keywords yay I called it past in my sample posts there we go and after that I'm going to want to tokenize so I'm gonna call this keyword output actually tokenized out put I'm going to use that key what was it called tokenized after yeah yeah tokenize after yank using a keyword output and if I was an R I'd be using pipes right now but I can never remember the syntax in Python and I think it's fairly convoluted it convoluted and from there once they are tokenized I am going to use that output to do my clustering here and then once I have clustering skew-t Oh Part B I'm going to get the mega cluster out of that uh nope but dude oh excuse me pardon me I think I'm allergic to something uh so I'm gonna get this big old cluster that we still need to break into a bunch of little clusters and again this code is uh I would not call it done yet but it should be a little bit less spaghetti at this point pardon me cuz we are only getting one cluster out from our code that we're using and that's just because it hasn't the getting the individual clusters function hasn't been implemented in the library that we're using yet so very uh [Music] reasonable clustering train all right so we've gotten that we can get rid of we've gotten that we can get rid of and then we can pass in mega cluster to get clusters maybe all right and so now our whole big clusters maybe that's in here excellent so now our whole big multi cell process only takes a single cell and we should be able to play around with it a little bit more easily and this is gonna take a while but this should all work hopefully famous last words and this is the output from the clusters being trained so this is okay so this looks like the time stamp and this looks like maybe the number of clusters I guess we never figured out what this one three one three zero three thing was alright so that's happening looks good is our self and it strutting I think it is so now we should be able to look at clusters maybe uh and look at let's say the first eleven I don't think these are the clusters it's the thing that I am noticing here so let's look at let's compare something so let's look at the length of mega clusters because if these are the clusters oh wow sorry each word should only occur once because we are clustering at the the vocabulary level or the lexicon level so let's get the length of the mega cluster 588 okay and now gets get the length of the set of the mega cluster so we're gonna remove any words I think they should remove words that are not identical okay oh maybe it is maybe it is the clusters but it looks like these are just words that are used together all right and then the set is all in alphabetical order okay how many clusters should we end up getting out of clusters maybe so I'm trying to figure out what we like what this data actually is here we got thirty-four okay so then it does seem like clusters and we had more than 34 posts so if you remember we had a hundred posts and we've taken a hundred posts and put them into 35 clusters Nikolas said the previous video had a top 10 list which had more understandable similarities yeah so these should be words from within a single cluster where words in the cluster have higher mutual information than words outside of the clusters excuse me so and we can also get let's see I think I we called a clustering die get similar I think is it and we could just give it a word like squeeze net and we can see that no words have any mutual information with it tabular so the top the most common words most commonly used with so the words that are used more likely to be used with tabular than they are in any other context are thanks pneumonia work downloaded your and amazing job in Kegel see what's really close to in in work and amazing Tagle sharing Colonel great competition nice commits Nicholas says but increasing the number of word sets from one hundred to a thousand change much so this isn't actually the number of words that we're changing here it is the number of forum posts so we actually to answer our panels is a question we did more or less build the pipeline so this is the pipeline basically we made things more functional we are doing keyword extraction and then brown clustering and the output of that is vocabulary or sorry lexicon level clusters so these are words that are used much more often with each other than they are you know how many do we have we had thirty s lesson it's twenty to thirty look like so these are words that are used more often with each other than they are on their own yeah yeah let's try let's try increasing let's actually try decreasing the size of the engrams and also increasing the number of forum posts and see what that gets us and after that let's just look at the first 20 clusters uh-oh and you'll you'll notice that I can tell that it's playing because we got a little more play symbol play some this way play some ball up here so the I think the co were the the tokenization is going really well uh I don't know how good the brown clustering is going so the the outputs of the the keyword extraction I was pretty happy with the outputs at the clusters right now I'm not super happy with and I think what we will do not next weeks I'm going to be out of town by the week after I think we are gonna start looking at embeddings and sort of distance placed clustering methods instead of information based clustering methods because I am not having a lot of joy here so we have clusters such as data tha a trough learning good model code and two competition of Kegel machine test time file set lot a training in train data set find problem for HTTP deep Python make with is link I'm not very happy with these clusters and we can actually look at the the keyword output and I remember being much happier with them but this is a slightly different test so here we have cannon model gamma hi Colonel help which i think is a perfectly reasonable keyword extraction Ricardo says I noticed her some questions about yank I should say that yake has not been trained at all in the sense that we didn't learn any model or apply to machine learning algorithm it's unsupervised I think it I would call it a machine learning algorithm I would call yake itself a machine learning algorithm Nikolas says why are there single word clusters so the way that Brown clustering work is every word starts in its own cluster and clusters are only merged if the cluster together has higher mutual information then each of the words individually so basically you only get words in the same cluster if they are more likely to be used together than they are on their own and if a word is used in a lot of different places it is likely to be in a cluster by itself as the good thing there Ricardo says yank is supported on statistics from tests and yes the language parameter can be changed to any other language and indeed we only you do it to preload some list of stop words oh that's helpful I would say that it is a unique language aspect that is the unique language aspect to the algorithm ok that's good to know so if you wanted to apply ake to a different language then the thing that you would want to add or have access to would be stop word lists in your language and stop words are very common words like and the of in English or maybe in French like la da la do so generally things like articles and prepositions tend to be tend to be stop words or depending on the language things like gender markings or like I think in Chinese people tend to remove those sort of like category markers like the guy in a Koran you can tell it's been awhile since I've taken some Chinese so helpful okay so I'm pretty like share your valuable insights topic share valuable insights I think that's pretty good keyword extraction I feel like I know at that particular forum post is about this doesn't unfortunately reduce the total number of forum posts you would look at oh let's get rid of this I'm not happy with these clusters not that I think that they're bad clusters but that they're not super helpful for me yeah the only problem with brown clusters I think makes it not a super good pick for this particular you're just showing off right now no I'm trying to be helpful and I've used up all my languages now except for American Sign Language and that's very rusty listen if you can't use your degree sometimes like why did you why did you bother getting it um my degree for those of you who don't know I think I might have mentioned it my degrees in linguistics so knowing about different languages is something I spent time on yeah so I'm gonna change this name to yay [Music] pipeline who says I'm kidding no but I do generally try to be useful and I think I got coffee on myself so that's how my day is going yep so we do have the whole pipeline set up and by that I mean we've taken all of the tasks that we have to do we've put them in to a series of functions that we can then use to make our code much more compact and understandable going forward I would move these functions in to a and I also like to and this is just my personal preference I like to read in data in the same cell where you read in your imports should be much easier to iterate and try things out uh and I think I probably could get there with the brown clusters with a little bit more tuning but I'm not willing to do a lot of tuning for this project particularly because I want this to be something that's useful for other folks and not everybody on the cattle team is like super comfortable just like working with machine learning just like I would not be super comfortable like handling all of our security stuff if someone was like hey Rachel now you're in charge of encryption oh I wouldn't be able to handle that and I don't want to be like hey you know teammate I know you don't know anything about hyper parameters but now you have to tune them for this model that's very brittle and fragile and won't give you good results if they're wrong and also I can't tell you which ones are good to begin with I think that would be frustrating for my colleagues so I am trying to avoid that and also I don't want to spend time tuning models every morning the same model that makes it less helpful because this is eventually a time-saving thing that I am building for myself what does this do oh that's kind of nice so the K here goes to all of your kernels cool um let me make this public oh did I not actually commit it all the way let's see oh no I should have a committed version it probably just needs a second too it probably just needs a second to refresh oh there's an error somewhere alright so let's try committing it again and see if we get a little bit more of a helpful error message figure out what the problem is yeah so I'm glad I tried Brown clustering I'm perfectly happy with the keyword extraction I don't think Brown clustering is the right choice for this particular project yeah Vladimir's I'm going to Collins first Khan says still a newbie in data science but having have been learning it would I benefit from this class I mean you're always free to join it's I wouldn't say it's really a class it's just sort of me working on projects with people hanging out with me and my my point is more to sort of show people what the the working process is like and what I'm working on and also to get help in real time because I often run into difficulties bottom ear says I suspect the right way to get clusters from the brown clusters as clusters equals clustering train so that does uh there we go so that does create the clusters to begin with the problem is that the particular repo that we're using they haven't actually implemented a way once you've done the training to get all the clusters so this sounds like it was written as part of a course so maybe for teaching the course and the point is to show people how the code to get the clusters is written to begin with if that makes sense so unfortunately there's no way already in the code to okay no where is that time it must have just been because I cancelled it so now if we refresh the pipelines oh I'm looking at the wrong thing I was like why is this why is this a script I thought I had a notebook I definitely had a notebook there's the notebook there we go I should definitely have hidden that output actually I think I'm gonna rerun it one more time hiding that output because that is too much text to look at with your eyes nope nope nope cancel hide output and then commit so yeah we do have a pipeline which is nice going forward we can use some of this for link which is nice next week or I guess the next time because I'm probably not going to do anything next week because I'll be out we will look at distant space clustering and I've got I've got high hopes we'll see if they are unfounded yeah so I think that's about all that I'm gonna do today I've got so much stuff to do and I need to drive somewhere tonight that I'm not super looking forward to so just just a lot on my plate right now yeah thanks for joining me everybody the colonel is available and it should if you go and look at it right now it's a cross Brown clustering pipeline you'll get this really long output but when this commit finishes you'll be able to see it without having to scroll a bunch that is public so if you if you wanted to play around at the bottom custer's you are perfectly welcome to I invite you to and next week or two weeks from now I guess we're gonna jump in and start by doing some embeddings so taking words and turning them into numbers instead of keeping them as words the whole time and we will also start to look at some distance based clustering methods and I think I mentioned I'm interested in you map I'm interested in hierarchical DB scan and I'm interested in spectral clustering and we might just try a couple of those and see which one works good and that's that's what we've got in the future all right so I'm gonna call it there I hope everybody has a great day or evening or whatever it is where you are and enjoy your weekend and I will see everybody on Wednesday I will have the Kaggle reading group and we will have a new paper and my guess so I'll open it up to a vote my guess is that people are gonna want to know about Excel net which is a recent model that's outperforming Bert which we've read about previously but we'll we'll see I never know you guys continue to surprise me and the best way so I will talk to you guys on Wednesday see you then bye

Info

Channel: Kaggle

Views: 3,927

Rating: 5 out of 5

Keywords: data science, deep learning, nlp, neural networks, nlu, natural language, python, programming, coding, machine learning, ai, artificial intelligence, kaggle, research, technology, reading group

Id: UnTCwHJsyPE

Channel Id: undefined

Length: 59min 21sec (3561 seconds)

Published: Fri Jun 28 2019