Live Breakdown of Common Data Science Interview Questions | Kaggle

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
yeah sure so I'm a developer advocate but unlike a lot of developer advocates I'm more focused on data scientists and software engineers hence the data and scientists right and my day-to-day varies wildly so today I'm doing this I get to be here with you guys I'm also preparing content we've been upcoming five day challenge on data cleaning but I'm really excited for answering people's questions helping people out so doing my own data science and analysis work so developer advocate is in my I mean you do you the thing that you're advocating for you right you do the the developer you do the data science and software engineering but then you also do a lot of outward facing stuff and talking to people and helping people out and developing materials yeah so I have an awesome job so I have a wonderful interface between the the sponsors the companies or groups that host the competition so provide the data and our community so a lot of my job is when a company or our group wants to host a new competition is working with them to make sure we have the right amount of data that the data is in a format that works for Kaggle trying to find any leakage that that might happen to be in the data working on what's the the appropriate evaluation metric and so there's a ton of upfront work much more than I realized when I was a competitor on Cagle but then the the other fun part for me is is not only preparing the data but then working with the community and running the the competitions and being on that that end to make sure things go smoothly for our competition so it's a lot of data science a little bit of you know what Rachel mentioned of developer advocacy to make our community successful and I should mention if we haven't already that water is also known as inversion in Parker's community so I'm sure a lot of you are familiar with that name and then just in terms of your own interview process I know you kind of interviewed at Google and you also are interviewing you know for Coghlan catalyst team and that's a little more roll specific so I love to hear just your experience interviewing for the role and you know what was important what types of questions were asked and what was really important for you to show during that process Walter do you wanna go first yeah be happy happy to so the the it was a day-long interview and at the San Francisco offices and you know I really didn't know what to expect you know there's so many different things that can be asked especially in data science in general so I would have to say that you know I did a lot of very broad-based preparation but the the actual interview process was I got to give a 20-minute presentation to the team and then question and answer session after that and one of the things that was fun was I got to choose whatever topic I wanted so you know you kind of like well yet the decision to be made and then after that I had five 45-minute interviews with a manager in various developer advocates and boy they asked a very broad range of questions some of which were very open-ended you know talk to me about machine learning and some were very specific I would like you to write an algorithm that does XYZ so you know it was a really fun mix it was a long day but it was also enjoyable in the sense that the the interviewees were very smart people and very personable yeah my experiences was very similar um one of the cool but frustrating things about being in a newer field is that the interview process isn't as structured yet so I guess you're going in for software engineering interview you know the types of questions you're gonna get and for a data science interview there's a lot more range so I was evaluated on my ability to communicate clearly especially around technical concepts there was a coding portion so a technical part of the interview sort of like what you'd get with with the software engineering role but a little bit less focused on them a little bit more focused on code readability and a little bit less focused on like the most optimal perfect solution for the problem and then there also lots of community questions like again like voltar mentioned just sort of like broad-based machine learning like what can you do what's a good problem to solve with machine learning what should you not solve with machine learning and I also got some data pipeline and processing questions so like how would you fix this particular data set what would you do about it or how would you approach the specific problem when the data that you're working with has these particular problems so wide variety of questions nothing so we're gonna get started but you know for people tuning in that have seen the other sessions this is gonna be a little different we're gonna try to ask some of your questions in real time um Walter and rich are really gonna be I'm let's doing kind of a mock interview doing some live coding so you might want to follow up to something they say so we're gonna do our best to do that and then at the end we're gonna have time for kind of a QA where on things we missed or just kind of higher-level or a more general question um so with that I will hand it over to Walter and Rachel all right so the the we're gonna start by having Rachel asked me two questions and before we go into the questions one of the things that I really want to emphasize this is a demonstration of some general principles of how you might approach answering questions with in an interview we don't want to necessarily focus on the actual content what I say may or may not be technically correct might you might have a better answer yourself you may want to answer differently depending on your interviewer and your interviewers background and knowledge so pay attention in particular to how in general we improve the answer and make it how we increasingly demonstrate to the interviewee that we know a broader range of material than is just answered in the question and and Rachel if you want to add anything to that before you ask me my first question I think that I think that sounds good and I talked a little bit afterwards about what each question was trying to get at sort of like the idea behind the question which if you configure that out and then answer the question behind the question in addition to the question is really good all right Walter I've read your resume it's very impressive and I have a question for you assume that I am a first-year college student I'm interested in data science and I've studied calculus but not statistics like that's a math background but I don't know a lot about stats I want you to explain : e era T to me oh yeah that's um that's actually pretty easy so in statistics : e era T or sometimes it's called multicollinearity as a phenomenon in which one or more of the predictive variables in a multiple regression model can be linearly predicted from another predictor variable with a pretty high degree of accuracy sometimes you might say that those two predictor variables are highly correlated so that's collinearity how did I do Rachel so technically correct answer but I remember I haven't had a lot of statistics backgrounds so you use some jargon I might not be familiar with like multiple regression models that sounds fancy a side note I do have a lot of statistics backgrounds this persona doesn't and also I would really appreciate like a concrete example here especially when you're explaining to somebody doesn't have a lot of background a concrete example can be super super helpful for them could you sure yeah yeah great so yeah let's say you you wanted to build a model that would predict the heating cost of a new building that you are gonna build and you had all sorts of inputs so for example you had the area of each floor of the building and the and the number of floors the stories and and maybe the total height of the building and the number of and size of Windows weather information etc so in general there's a very strong relationship between the the total height of the building and the number of floors or stories on the building and so we would call those two variables collinear right because they tend to you know as one goes up the other goes up more floors higher the height and it's probably worth no that collinearity isn't necessarily good or bad rather it's it's something that should be considered part of the exploratory data analysis and model building process so for example you might decide that you want to remove one of those variables because you don't need both just so you can simplify your model on the other hand you may find that there may be reason to keep both for example in this this example if you took the ratio of the building height and the number of floors that ratio might be different depending on whether it's a residential building or a commercial building or like a multipurpose building and so that ratio of those two might give you additional information and another reason people might be concerned about : arity is if you have a model that may affect how you interpret the model weights and if you'd like I'd be happy to give you an example of how that that works in practice at that site I don't need an example but have you ever had to deal with it you know interestingly most of the work I do the models I tend to build are robust to correlated or collinear input variables and these tend to be like tree based methods like random forests or extreme gradient boosts so but there was an example I had where I was working on a research project where we had to predict it was a it was a sensor that had various inputs and we had to predict an output and the model had to be very simple because it had to fit in a very small program Bowl logic controller on the sensor and we had potentially 300 inputs which was far too many for a model so what I did is I actually systematically went through and looked at all of the correlated variables and continued to reduce the ones that weren't needed right that were redundant and ended up having a very small model that had the highest predictive power and I did this recursively by looking at very importance and slowly backing removing the ones that were less important until we had the highest predictive model with the smallest number of input variables and you know there there's a number of ways that I would recommend so if you had a bunch of data to to look for correlated variables so there's some statistical tests like a correlation coefficient test or something called variance inflation factor but my favorite is just doing scatter plots of all the variables and you can very quickly see that you know there are variables that are very highly correlated yeah great thank you all would you like a little review of your answered please that tell me how I did so the first answer you gave that was just like the textbook definition good perfectly fine technically correct shows that you studied you have this knowledge within your expanded answer where you talked more about a concrete example that would be a really good fit for these the audience this college student and also your specific example and also how to deal with it that shows me that you really thought about this that you've had experience with this and collinearity is a really common thing that comes up in data science right so I'm getting good evidence that you have a depth of knowledge around this topic and statistics /ml so great yeah thanks for in terms of prepping for an interview that you know people come up with kind of stories like this for common concepts that might come up um would we recommend that yeah yeah absolutely if you have like if someone asked you about a concept you can be like yes I know about that concept and here's an example of me applying it in a project and like a really sophisticated savvy way that's a really good way to show these like you know what you're talking about you have experience and sort of back up just the knowledge that you have right you could kind of back into that by looking at um the concepts that you've used and then coming up with stories and then having them on hand I'm sorry Walter were you yeah so I would say at least for me it's really hard to do that on the fly right if somebody says can you think of an example so as part of an interview prep I highly recommend you systematically take a peek you know a Google Doc or a piece of paper and write down every project you've worked on every Kaggle competition and and just systematically write down the the feature engineering you did how you did your model reduction any issues how you overcame them and you know that may be a very long list but at least when you go in the interview it it's easy to keep those things in mind versus trying to think of them on the fly plus if they're on your your resume or CV people are gonna ask about it they're like oh I noticed you you know models marmoset mating in Indonesia two years ago tell me more about that so I'm thinking of I have that information your fingertips is good so you did read my resume very closely yeah are you ready for another question we'll try to be as we look more from the audience model no no questions now if I go go for it all right so the last question I'm sort of getting at your your statistics your ML background sort of your conceptual knowledge of the math behind stuff this question is gonna be more on communication so I want you to explain the difference between statistics and machine learning but I want you to assume that I am a manager and I don't have a lot of deep familiarity in either of these fields so this is a great example of a question that is very open-ended and allows you to shine or to you know trip up over yourself so let me do my best and see what we can do so this is actually a very often hotly debated topic and you have different camps of people that may say statistics and machine learning are completely different that or that machine learning is just statistics or you know a subset of statistics or it's built on statistical theory I don't want to get into that kind of debate so I'll talk about some things that are fairly general and you can always find a an edge case that will break any of these generalities that I'll say but I'll try to give you a general feel so in statistics quite frequently we use statistics to test hypotheses so an example might be do right-handed batters hit more home runs than left-handed batters right we can collect the data and we can do a statistical test and you know reject the null hypothesis which is a fancy word of saying do is there a difference so in in statistics we tend to simplify the data right we we might just simply write and left-handed we might not add all these other variables we tend to simplify to just a few parameters and a lot of times our models are linear right again to do some some we reduce the the overall the model so that's I think one aspect of what statistics is it in statistics the treatment of outliers tends to be very important right if you think of when you've done a linear regression in Excel and you had one outlier and it's skewed the the whole line in machine learning again this is a generality but we tend to focus on prediction so we might say what's the probability of this batter in this inning is going to hit a homerun against this pitcher and in order to answer that we need a lot more inputs but because we need a lot more inputs we also need a lot more data to build their model and we're in statistics we tend to choose a like a model so I'm going to use a linear model and machine learning we we have classes of algorithms that have very flexible models that will or algorithms that will build models around your data so they tend to be much more complex and much more able to handle highly nonlinear and complex data but there is a downside to that your we're in a simple linear model has a couple of parameters a complex machine learning model might have thousands or even millions of parameters so you know a lot of people say well which is better statistics or machine learning often times we use them for different problems even though there's some overlap and we just as with any tool we have to understand the pros and the cons of which and when it's appropriate to use those and a lot of times that comes with experience and which is you know one of the ways that you know we improve ourselves by just working through lots of different problems so just to give you a little bit of feedback on that on the answer so far um I don't want to get into like a deef are given about this and I think overall I grief pretty much everything you said putting on my I'm a manager and I don't know about statistics huh there was some jargon in there I might not have been counted before you talked about parameters may or may not have encountered that you talked about linear relationships and linear regression and maybe I've done linear regression but probably I haven't so do you want to gonna give a give it a slight tweak and see if you can tailor this answer more closely tear your imaginary audience sure and before I go back into the persona of answering the question this this is actually a very difficult part of the interview process is understanding what is the right level of communication and when you have to for example as part of the interview process I had to give a presentation it was very difficult to know the level of complexity in the presence should I go really complex should I go very simple you you go too simple you and you know you might insult the intelligence of you know the people presenting it but if it's too complicated so I don't for the audience I don't have a simple answer other than it's very important that you think through you know what the right level is and how do you think through that Walter I mean do you kind of prep ahead of time by looking into profile there you have a good idea going into each session I would I believe strongly and over preparing if you know who's going to be doing your interviews I would definitely look at their profiles see if they've done presentations on YouTube that you can watch to get a feel for their personalities and what they know if they've published seminal papers on the topic that you're gonna present that's different than if you're right you know they don't have any experience with what you're gonna present so I highly recommend that that's a great point honor you can also just ask so I have a an NLP background the natural English processing background and if I'm going to talk about something like a text analysis I'll ask like do you have a background like do I need to explain like what a DTM is or do I need to like slow down and what what's your level of experience and I can use that to adjust how much background information I give so so yeah so let me just give her very a much higher level summary and again this is very general but with statistics we tend to be happy with smaller datasets the larger the datasets it's most of the statistical tools start to struggle with machine learning we absolutely have to have larger data again not not in all cases but it loves large datasets and actually a lot of the things break down with small data so when you think of statistics and machine learning a very simple way to differentiate is just the amount and complexity of information that you have and that you want to model but ultimately they're they both come down to modeling data okay all right I found that very approachable good leave a question just about the interview process that just came in so I know in this mock interview Rachel you know we're kind of purposely doing a a good answer and then a stronger answer but in your experience do interviewers usually ask you or you know have they ever asked you to kind of prove upon an answer or you know kind of a different answer yeah absolutely absolutely a good interview will like you'll give an answer and then they like ask additional questions to help you if you're if you don't like immediately hit it right down at the product though they'll usually help you along this is my feeling you haven't done a good job if you get a follow-up question or yeah and that's particularly true if you're asked like a coating or an algorithm question because almost certainly whatever you do they'll they'll say oh did you think about this edge case and there's a good chance because you're on the spot you didn't and so then they'll ask you to improve your your algorithm or your function to handle that case so you should to your point you should expect that and not at all feel like your original answer was was wrong personal story that when when I was interviewing the interviewer says oh think of this scenario and I thought oh that completely breaks the the function and he said well no not at not necessarily it depends on what how we want it designed and I felt a little bit better so you don't have to feel like it's you messed up alright so let's see so now Rachel I thank you for the wonderful interview questions so the tables have turned and what I'm gonna do is we're gonna we're gonna share a document and I'm going to have you write a function I'm gonna actually paste the question in the the document I'm going to do that right now and I'm gonna give you a second to read that and write a function that accomplishes what's on the what I just paste it in the document so I'm trying to have an hour has gone full screen I can't seem to type in it remember that escape trick yeah click on the webinar then click escape it should work for me yiii there may be at the top of view option at the very top of the screen last time I did this I made Walter disappear so I'm being very careful not to click on random stuff we're giving people time to come up with their own answer to this question so we built in this delay to the for the presentation my original trouble go ahead the actual question so it says write a function which tells you which rows in a column contain an outlier you can use any programming language you like but you should be prepared to justify your choice all right I'm ready now so I'm just gonna answer the question sort of backwards so I'm going to do this in our other options for this would be Python I guess maybe you could do it in something like C I don't I wouldn't recommend that and the reason I say that is because reading through the question I see that I want to know which rows and a column contain an outlier and to me that says that I'm gonna be getting the data in a data frame object and in in Python you use pandas to interact with data frames in our data frames or a first-class object and personally I like the the tooling in ours so I tend to use our if my data is already in a nice table to work with so we do this in our uh what should we call this function how about rows with outliers and it is a function and it's going to take in a column so I just like to do the correct syntax for the function itself and then I'm gonna do some pseudocode in here to sort of think through and put my thoughts in order so the first thing I want to do is and I'm just gonna write myself comments I can I can follow along so I want to get the maximum value [Music] not an outlier and I actually want the maximum and minimum values so I'm going to get sort of like the both ends so everything in the middle would not be an outlier and then everything past that would be an outlier me and I values that or not well sorry they're not I don't need to go through and fix the the pluralization here that's not important and then I'm going to look in the column and see which rows are below the men or above the max and then I'm going to return a boolean vector with those rows and that's not how you spell column there we go all right um so I'm going to first I'm going to get the maximum value that's not an outlier and it's auto correcting my capitalization so we'd get the mean of the column and then we'd subtract 1.5 times the standard deviation of the column and then the max would be the same except instead of plus no sorry that would be the minimum I'm getting things backwards and then this would be the maximum so instead of - it would be plus coding and where documents is the worst because it does all the autocorrect that's just not nice with code so I'm going to use some of the tidy verse functions so this is a pipe it will take and that's on the left side of the pipe and put it into anything that's underneath it to the right side of the pipe and I'm going to create a new column and I'm going to call it outliers and it's going to be ah the things in my column that are less than the minimum I don't get this backwards or the things in my column that are greater than the maximum I think actually all of this needs to be in parentheses and then I'm just going to select that column I just made so I'm going to pull that out and I'm gonna call this object outliers and I'm gonna correct the capitalization and then I'm going to return that boolean vector that I just made so I've got this minimum value this maximum value I'm going to look in the column that I'm given I'm gonna create a boolean vector that is true if the the value in that row of that column is less than the minimum or greater than the maximum and then I'm just gonna yank that out and then return that great well so let me give you some feedback on this so first of all absolutely fantastic justification of why you use are really appreciated that and I really liked how you wrote the pseudocode for it's not even pseudocode wrote the the kind of the placeholders of what you wanted to do that made it very easy for me to follow in seer logic but what you did exceptionally well was talking out loud as you code and this is something that people often struggle with most of us don't write code and talk to ourselves most of us maybe you do Rachel or maybe most likely you practice this which is a key point to those that are listening if you are going to have a coding practicode aproblem you want to practice coding and talking out loud and explaining your thought process because it is so awkward for most of us it doesn't necessarily come naturally so but what it does is it Rachel it allowed me to understand what you were thinking and why you were thinking and you know maybe maybe you said the right concept that maybe you wrote down the wrong thing but at least I know okay she's thinking the right thing but just you know mistyped or whatever so you did an excellent job at that so let me mention one area where you can improve this exercise there were a number of assumptions you made based on the problem statement to write this function so what I'd like you to do is think about maybe what were some of the assumptions that you made and maybe clarify those and then improve the function based on what those assumptions may or may not be yeah so one big one that I made was that I assumed that when you said outlier you meant something more than one and a half standard deviations from the mean which may not be what you want so sometimes you want two standard deviations sometimes you want more and I'm sort of in writing this function I've locked you into only looking at things that are one and a half so one thing that I could do is I could create a new variable so that you can specify how many standard deviations from the mean you want to consider an outlier and I can call this something who is it that said the only heart thing in computer science is naming stuff let's call it how many sd's that's a good name and then I'm gonna set that to the be a default of 1.5 and then I'm going to refer to this so you could have things that are maybe you only considered an outlier if it's five standard deviations from the mean maybe you considered an outlier if it's you know two standard deviations that means you can specify that now another thing that I assumed was that what you wanted was a boolean a vector that's a boolean that is true in the rows where there is no iron is false in the rows where there is not maybe you wanted all of the outliers returned as numbers and the way I've written this function you don't have an option to do that you're just going to get that that boolean vector last letter right very good and we don't have time to go into that so right that's exactly the the kind of clarifying questions that are good when you read the question think about what assumptions that it's good to maybe ask those before you start coding there's an additional thing that you assumed and how do we do with missing values so why don't you think about that and tell me what could be some changes yeah so um I assume that we don't have any missing values I also soon we get numbers which we may not we might get like a text column that does not actually have any numbers in it so one way that I could get around that is I could write a little test in here and I could sort of kick things out so kick out an error message and not run the function if I'm the cook there's a bunch of my days or whoopsie you've got some text in here and things that aren't aren't numbers another option is I could come up with something to do whenever I come up to an a so I could just draw old enas I could use something like na dhanam it to just get rid of them guess some value that the na could be and and substitute that for all of them so I have some options there yeah very good so I think we're good on this and it really highlights that such a simple question has so much opportunity for the interviewer to or the interviewee to ask the inner you were clarifying questions and as you did that Rachel you showed me oh wow she actually doesn't know about all these things and thinks about them so it's a it's a very good opportunity to let the interviewer know you know what what you know and and how you think so well done and then we have a couple questions coming in just regarding kind of using the correct syntax or if you might have a small typo I mean you know as an interviewer how important is that to you and you know as a person you know maybe coding and something like this you know how should you prioritize that and what you're thinking about I mean try to be right but I certainly don't code not non-id or not with like at least tab complete so like I don't I don't think it's a deal breaker it's just try not to meet guys and I was looking at my kind of like where's my syntax error yeah I would agree I don't think anyone holds a typo against against you in fact you know a lot of you know I've been in the situation where I said well I I think this is the correct syntax let's assume that it is and move on you know if you'd like to challenge it and and it was like no that's fine you know whether or not it worked I don't know one thing if you do have the documents so when I interviewed the word documents that are the Google documents that we're going to use in the live coding were made available I actually formatted it so that it was you know mono like the courier new so it was monospaced font and then made my my tab spaces every two and so I just tried to and also turned off auto cap which made it a little bit easier not the same as an editor but it did help was there another question no I think we're good okay all right so I'm gonna drop in another question to your doc Rachel and I'm just gonna do below here just what you're doing that to be clear at that question was evaluating coding ability coding ability and right that the thought process behind it all right so let's uh Nintendos I'm just gonna I'm just gonna read through this well you do what you're doing so you're working with a veterinarian an animal doctor who's got a data set with information on horses they've treated they want to build a model to help predict whether a horse will need surgery or not based on various health metrics to help owners figure out when they need to call the veterinarian and they send you their data file I've got the first few rows down here and you quickly notice that there are some missing values I do in fact quickly notice that there are some missing values how do you proceed okay so one thing I could do here is I could drop all the rows that have missing values in it and just get rid of them and sometimes especially if you only have like one or two that can be like if it's not worth it the trouble of figuring out how to how to resolve those missing values sometimes that can be a time saver but there are so many rows with missing values here that that would remove all but one of them I could also remove the columns with missing values and then rows with missing values but that also would would remove a lot of our data and would just be sort of throwing it away so another thing I could do is I could do imputation to try and figure out what those missing value should be and come up with what they should be so I could let's see so we have here one of the columns is rectal temperature so I could for this n/a cell that's the second from the bottom I could look at the average of rectal temperature and I could just assume that that horse had whatever the average rectal temperature is I I don't know that much about horses I'm not sure what that should be and for the ones that have a so not a not a numeric value but a sort of categorical variable like this abdominal distension here which is like tummy thank you miss I could and there's a there's an N a3 up from the bottom I could say guess that that's gonna be whatever the most common degree of abdominal distension is but I'm guessing is none but again I don't know great yeah so so technically that was fantastic I love how you explored the different options what you could do the you know some of the implications of the different options so I thought that was really fantastic you know one thing I think you could have done better is potentially talked a little bit about what does n/a mean right in some cases any means that the data could be available but it just wasn't collected in other cases it means it's completely not applicable to this particular instance so that may have been an improvement one thing in particular that I love to hear from candidates is when they talk about talking to the customer and so much in data science has to do with interfacing between somebody who owns the data and you know a marketing group or whatnot a research group and then you as a data science working with that so talk to me a little bit about how you would go about talking to the customer yeah definitely so this is a veterinarian and I can talk to them and ask the question about horses and then they will tell me answers about horses and I can I can use that so I was saying I would I would do the the rectal temperature I just guess it that it would be whatever the average is one thing I could do instead especially if this is a data set of horses that are already sick I could say that if it's missing its gonna be whatever the baseline for a healthy horses again I don't know what that is but I'm sure the vet does so I could just ask them and be like hey if there's a missing value I can just put in that it's like normal for a horse and another thing I don't necessarily know what some of these things are so there's like natural gas stick reflux so like I guess like no stomach juice horses are weird and we've got some some options here like the amount so I can ask if if there wasn't any natural gastric reflux is recorded is that because there was none can I just put none were there rather than guessing a bit like oh all of these horses have a leader of snot I guess sorry a little bit gross so I'm guessing that all horses have like a leader of snot or whatever and I could also you know use their specialized knowledge when saying okay maybe so there's this this pH me if there's no national gastric reflux there's there isn't any pH like there's nothing to measure cuz there's no like fluid from so what you do with that should I should I separate this as maybe like a separate like a separate data frame so we could use her like a a branching decision structure where if there is national gastric reflux and then we need to worry about the pH and if there isn't then we don't so really working with with the veterinarian and I think would be a good way to make sure that we build the most robust possible model for this data set yeah great and Rachel again I have to say one thing that you just did really well is is talk through and help me understand your thinking process oftentimes technical people not always but oftentimes we really do struggle just giving the correct or technical or short tourists answer and you did a really good job of exploring the possibilities and what you know and and don't ever be afraid to do that if the interviewer wants to move on the interviewer will let you know say it's time to move on but in general we want to make sure we let them know what we know all right that's I think that's it Rachel anything else you want to add oh not really I think on that's probably gonna have some more questions for us so yeah and a couple questions we actually have we didn't prep you for this I can do it somebody wanted to see them need submitted a sample question I never wanted to know how you would answer more kind of a soft skills type question or just something that wasn't so technical um so I'm going to go ahead and ask the question and then um one of you can answer first go and obviously they have a slight advantage so you know but one thing that I'm gonna do and this is one of my tips to you is I'm gonna take a dose well the question is being asked so that I can refer back to them later okay great so this is a seemingly simple question it's just is there a machine learning technique that you feel you need to learn more about like me personally or like in general that one should learn more about do you for this you know for this role or to kind of achieve your goal oh I got a slam-dunk answer for this Walter well then let so my background is in natural language processing and I mention my PhD is linguistics so I know a lot about sequences and not very much about images so I would really like to learn more about a lot of the more image focused especially deep learning techniques so like I know about CNN's and I could like build one if I had to but I don't have like a deep knowledge of like gans you know or a lot of the things that are really focused more on image so I think that as a data scientist is an area of my personal growth and is that something you know given that you have you know a very specialized knowledge in this area as you were interviewing you know at Google or coggle or you know other roles that you might have interviewed for how do you kind of play that as a strength and not add you know perceived weakness of not you know is it is depth versus breadth considered better hmm if I could jump in on that one so this field is so broad and moving so fast I don't think it's realistic to expect everyone to know everything so I my view is that it's very important to very much understand your strengths also understand your gaps and also be able to speak intelligently enough about them so for example I'm the same thing with a language processing I done a little bit but it's basically cut and paste encoding and making it work so you know don't be ashamed if you're not an expert at that but also the thing I always emphasize is having a learning plan and letting them know that this is my learning plan because you know I want to know more about this here's what I know here's what I also want to know I think somebody that is introspective enough to know their weaknesses and know how they have a plan to fill them is actually a person that I want to have on my team versus the person that doesn't know what they don't know so you know don't don't in my opinion don't be afraid to let them know what you don't know and your plan to fix that or you know we're the remedial plan okay so what don't you know Walter what machine learning techniques using you should learn yeah yeah I am very very weak when it comes to any of the text-based NLP like I said I've done it I've played around with it but truly understanding the the nuances and how it works and especially things like like like leakage in a cattle competition I don't know what the standard tools are you know like Eskie learn transforms and all that but yeah that's that's probably my biggest opportunity for growth awesome um and you know this conference is really you know targeted at people looking to land their first data science job so you know I'd love to know if you just have any advice for people you know prepping for this interview process for the first time like are there resources that you would you know push people towards in terms of you know mock interview questions getting practiced like how can you do that - prepare for giving answers like you just gave so what well I'll start and then Rachel if you want to finish on that so at two things personally I traded a private github repo that had you know markdown that I every question I had that I couldn't you know I knew fine but if I didn't know I'd make notes and just kept building upon that and if five years I need to do another interview I've got that so I would recommend that you built an asset of notes and your experiences and tips and tricks and things you want to remember and build that either you know in a notes or github repo however you want to but I'll also say right it's never ending so you have to balance what it makes sense to learn and learn well and I would recommend learning the core things well and then expanding that versus trying to be so broad that you don't know anything well yeah I have kind of a thing that's altered but I approach it from a slightly different angle so I'm really I'm a data scientist cuz I love data and especially with language data so the thing that keeps me really motivated to keep learning is I have the new projects and I want to really learn things that I can do that project so I if any of you or knitters there's this distinction of like a process knitter and a project knitter and project hitters like I want that scarf right there and process there's like I love the feeling the yarn in my hands and I'm much more of a project scientist so for me in my own sort of growth declaration having a specific project that's something that I'm interested in and passionate about and want to learn more about is really motivating for me and also say again it's a young field like if you're interviewing for software engineering position you read cracking the coding interview it's not all relevant to us there's like half a page of statistics in there so that the centralized resources aren't as robust multi-well and I wanted to come back to actually a Rachel something you just did for the previous question which was pick up your notepad but I think this is something you know was surprising to me interviewing interview prep for Google even not data science but you know a lot of people coming into their first interviews especially if it's out of college but just in general I mean I'd interviewed before you know I hadn't been given that advice that it's okay to like take notes and stop and think about a question and say you know I want to think about this for a minute so you know what's appropriate what do you usually do in terms of like how do you feel fill the question what are your next steps like not even with the answer but just your process for coming up with what you want to say yeah so I usually take notes as the questions being asked and then I sort of like say back the question and I sort of like paraphrase it and I say what I heard them say which isn't necessarily always the same thing and it's not like especially if you're just coming out of college it's not like a test right they expect you to have like some knowledge and this but they don't necessarily expect you to know the exact answer this particular question because you've prepared for it they've given you the exact information you need to prepare for it so really making sure you're on the same page and also showing that you'd be a person that's good to work with that you can communicate and that you're not not afraid to ask for help and reach out is that's another part of interview process right it's all it's all secretly soft skills under the hood yeah and I'm gonna answer tangentially so it's not answering that direct question but it's a trick that I like to do is you know when you go through the interview process you know they're gonna say it's do you have any questions for me that's another thing that can be hard to think of on the spot so I make sure I make a list of questions that I'm gonna ask and I have those like you know I have a folder that I have open and not paper scattered all over the bed but when that comes up I have that piece of paper out so when the question comes if I like can't think of anything I can just glance down and go you know ask that question so I think you know that that idea of having things on paper and even just like you know brief cliff notes of what you want to your memory to be jogged is very valuable and don't underestimate that in a in a live interview situation my favorite question for them is if that you could change one thing about your job what would it be because it gets me an idea of like what is gonna be the most frustrating thing in this position and that's really good to know because you're interviewing them to like you don't have to take every job offer great well thank you both so much I want to give you each a second to plug something exciting that you're working on it coggle because i know you both have you're deep into and Rachel you're about to launch something exciting so Rachel you want to start start so some of you may be familiar with the five-day challenges that I've been doing which is like five little coding exercises over the course of a work week and we have the single most requested topic coming up started Monday it's data cleaning in Python I'm super super super excited great and so I'm at less liberty to pre-announce competitions that are coming up but I think in general one of the things we are working on is more of the kind of less on the the trodden path beaten path of competition datasets so we are working on some that are appeal to a very broad set of people but in a more interesting type of way so we have plenty of competitions coming down the pipeline so if you like competitions I recommend you check back regularly
Info
Channel: Kaggle
Views: 58,831
Rating: 4.949646 out of 5
Keywords: Kaggle, Kaggel, coffee chat, live-coding, live, learn, api, cli, python, data, data science, interview, questions, transfer learning, coding, networks, programming, technology, tech, machine learning, AI, artificial intelligence, coders, programmers, help, tutorial, projects, 101, rstats, stats, statistics, what is kaggle, how to, github, developer, kernels, datasets, data visualization, deep learning, sql, challenge, competition, whitehat, code, lesson, CS
Id: aXUsrKPTBvY
Channel Id: undefined
Length: 54min 32sec (3272 seconds)
Published: Wed Mar 21 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.