Data Science Job Interview – Full Mock Interview

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this mock interview will show you what a data science interview is like Keith Galley is an experienced data scientist and interviewer and in this video he interviews Kylie Ying Kylie has taught multiple machine learning courses on our Channel this is a great video for anyone currently in the job market for a data focused role it's also a solid video for anyone who wants a better understanding of the machine learning process they cover topics that include building a data set for training testing purposes feature vectorization and model implementation details consider pausing after the questions and thinking about how you would answer them hey what is up everyone and welcome back to another video in it I'm going to conduct a full length data science interview I think that this video is great for any of you that are in the process of applying to a data science or data analytics role and want to get a sense of what a strong interview looks like even if you're not actively applying for a role I think that this video is also great for anyone that wants to build a better intuition of the data science process and the steps involved in creating machine learning models in this video I will be interviewing Kylie Ying Kylie is a fellow Tech YouTuber who makes content on topics that include programming projects python tutorials and lifestyle vlogging Kylie has both her bachelor's and master's degrees from MIT and she has a track record of success in the interview process receiving offers from Top tier companies such as meta the format of the video will be roughly a 45 to 60 Minute technical interview followed by a breakdown of how that interview went where both Kylie will be giving me feedback as the interviewer as well as I'll be giving her feedback as their interviewee some important contacts I have conducted dozens of interviews when I was a part of the founding team of Posh Technologies from roughly the size of just five people to all the way until we were just under 40 people I participated in most interviews so I've seen the good I've also seen the bad the questions that we walk through in this video are an adaptation of the style that we used in one of our interviews for data science candidates at Posh so I try to make it as real world as possible welcome uh how's it going today good how are you I'm doing pretty well the weather's not been too bad so I can't really complain um exciting to meet you uh exciting that you're interviewing for this team great to meet you too uh we've been uh looking for you know someone to fill this role for the past couple weeks and I know excited by your resume and just excited to learn more about you and kind of uh you know walk you through the task that we'll be uh solving today so I guess to start off I'm curious you know I read your resume but um just tell me a little bit more about yourself and I guess what you're looking for in your next opportunity your next role yeah so I finished my masters and undergrad from MIT I was doing a lot of electrical engineering computer science along with physics and in my next role I'm just looking for something more quantitative something intellectually stimulating um and you know just somewhere I can learn a lot so I saw this job and I thought I'd apply because based on the job description it just sounded like everything that I wanted you know yeah definitely and and honestly from my experience like we have a really hard working team we're working on hard problems exciting problems so definitely like fits that category I guess just to break into that a little bit more like what you just I said a lot of things computer science uh you know electrical engineering and physics I guess what made you kind of study all these things I guess what uh excited you about just that you know spread um I I wanted to I was I had the super like I was very interested in Ai and Robotics and I just really wanted to be able to just someday create Jarvis like to have the tools to create Jarvis and uh which by the way is Tony Stark's superhuman AI um and so I thought you know with computer science I would be able to have the programming the AI part of that but then also with EE I would be able to literally build the Iron Man suit which would be pretty cool um and then with physics I think it's just I've always been really curious about the evolution of the universe like what's out there black holes the galaxies stars all that good stuff and so when I took a relativity course it just really blew my mind that there are all these things that I don't I just you don't even know that are out there right like light bends around stars because of gravity like that's so cool yeah no definitely it's fascinating like especially exploring outside of my realm I don't have that physics experience that I'll have that kind of uh you know uh space exploration all of that fun stuff experience but I am fascinated it by it as well and also a fun fact I'm sure you're aware but Tony Stark was also an MIT grad so you're in good company there well I wanted to go to MIT smart um okay so I guess before we get into like kind of the real bulk of this interview I mean I think you kind of hit on it a bit with that but I'm curious to just dig into this a bit more so I mean in the past year or two even just really like past like six months there's been a ton of developments in the artificial intelligence machine learning space I'm just curious if there's anything that kind of sticks out to you that you're particularly excited about and if so like what that is and what excites you about that yeah so I definitely think that what everyone's talking about right now is like check EBT right because it's super disruptive everybody's like Googling how to write their essays or how to code up this you know something script um which is awesome but I would say that you know this this AI stuff I mean I'm super interested in it but more specifically I think I'm really interested in reinforcement learning and um and just getting a computer to beat humans at certain things or robotics in the sense of like self-driving cars I think that's really interesting being able to teach a robot how to interact and basically do what a human would do um also things like superhuman chess AIS or superhuman poker AIS I think that's all really cool because it's it's beating humans at a game that you would think you know is only for humans like chat like uh poker is built on psychology math and Randomness you would never expect there to be this like Nash equilibrium like Optimum solution right but there is which is really cool yeah it gets really really interesting when you combine the psychology element of the human experience with machines with AI and I think that from my experience that's something that's often overlooked and it kind of bothers me that it's overlooked because it's so important to the concept of artificial intelligence if it's going to be that intelligent so yeah definitely I think those things are super exciting awesome uh let's uh switch gears and I guess to give you a little bit of context my name is geeth Cali I'm a senior data scientist uh here on the team uh so you know the question that we're going to kind of present is something that we're actively looking at um here so this is kind of the task right now um so you're working as a data scientist on the team here at glitter um you know our CEO Mr elong tusk it has been complaining a lot recently and where social media site may be similar to something that you might be aware of but our CEO has been complaining a lot recently about the serious bot issue that's kind of infiltrated our site or you know he kind of has mentioned has infiltrated our site and uh become a a serious issue that we want to look at we want to you know mitigate and and solve so you know some things that these thoughts are causing some issues are you know they're quickly responding you know when our CEO posts and you know within a minute there's you know hundreds of replies promoting some sort of you know site some sort of program some sort of you know money scheme so there's Bots that are doing these you know quick postings uh there's you know coordinated Bots that are all over that you know the ecosystem that are sharing propagandist messages and sharing fake news messages so that's a little bit of context I guess before I dive into like you know some of the specific kind of guided questions um does that make sense uh what I kind of presented and I guess first off a question to you just kind of uh this is you know not really a data science question but do you have any thoughts on like other issues you know bots on this social media site might cause and things that we should be thinking about you know when we're investigating this issue yeah so I definitely think that obviously Bots are annoying right like nobody wants to be tagged in a million things nobody wants their entire um uh what's it called notifications to just be Bots spamming them right and I also think that Bots take away from the real conversations that might happen on this platform so um to reiterate um your question is I guess the current issue that you're trying to solve is that there are all these bot attacks and you know immediately after posting things get really crowded with these coordinated messages and spam messages and you're asking like what else yeah just this is an exploratory question I'm just curious and you know it's not really data science it's more platform driven I'm just curious you know right pick your brain and see if you have any other ideas with like potential issues with having serious you know bot issues yeah well I mean I just think that it like like I think I think Bots are there to Target A specific group right like all these money making schemes like do you really think that they're making money or is it just trying to scam people and like I know that especially in a lot of the crypto world there's a lot of bots that might that that might just be links to these fraudulent sites which end up draining your wallets or stealing your nfts or I don't know all these different um malicious schemes and so that's definitely one reason why like Bots are bad right because it's not just that it's crowding things it's also that your users are literally being the ones scammed off of and that does not create a good reputation for your company yeah definitely I definitely agree and you know obviously that's what we're trying to avoid so also I'm going to get into more guided questions but also like given this context and just you know your background as a data scientist uh you know do you have any any initial thoughts about the problem anything that immediately sticks out like that you would want to investigate about you know this this high level I guess issue and we'll dive into you know guided stuff but I just kind of also want to just gauge your high level you know thoughts on on the situation um by like thoughts on the situation do you mean just like what my first thoughts are on how I would tackle this problem exactly exactly yeah so I definitely think that with these Bots you mentioned a few different things right that it's coordinated that it seems like they're filling up immediately after um a post has been posted um it seems like you know there might be links to certain things or um certain themes that the attacks are always about and so I would start maybe investigating a bunch of these things and trying to figure out certain features of these attacks or these spam messages that might be able to for us with certainty say whether or not this is a spot a spot a sambot attack yeah um and I I also think that you know there are certain other attributes such as if they tag a ton of people right or if a single account posts multiple times or maybe the account follows a ton of people but nobody follows them I think these could all be certain traits of a Spam bot um and when you have multiple of these things in conjunction with one another then that strengthens the case that this might be you know spam um so I think that those are all different things to consider and maybe we can train a model based off of that to classify whether or not a message is Spam awesome yeah and you're kind of uh I guess getting ahead of even what I was I guess planning on asking next but this is exactly kind of where the discussion is going and I think I think because you've already kind of hit on some of these points I think maybe it would be nice to kind of drill down a bit more on okay you know some of these features that we're talking about here so what I'm going to do just because I think it's helpful for both you know myself as the interviewer as well as you you know you know brainstorming and thinking about these things I'm going to share a Google doc and let's try to like I guess get some of this on paper and we might drill into some of these a bit further and then I'm going to kind of ask some additional questions about this process and we can dive into those aspects okay further so sounds great I will share it just through the um the zoom one sec the zoom chat oh no no is that uh can you see the screen yep um not on Zoom anymore oh not on Zoom interesting in line right now oh yeah there we go let's try that again I think it's because I switched uh where it was on the uh okay the screen I have two screens open I switched the screen you know how it is uh so can you see that on Zoom yeah cool all right so I guess let's just start off um hello as well I'm going to delete that quickly um what are some features we should investigate um so I I label the question as what are some features we should investigate you can kind of interpret that as you will just basically let's just drill down and just get some bullet points maybe of some of the things you just mentioned just because I think that'll be helpful for reference throughout yeah so I definitely think that when you look at some bot attacks I think that there's probably a few main things to look at but the two that stand out to me are the content of the specific um the specific posts so content of the post and then as well as the person or the account that's actually making the post so I think in terms of content of the post these are things such as um how many people do they tag how many tags or for example um timing of the post or words mentioned in the post um what are other things maybe links Maybe maybe images um let's see what else when I make a post I type something send it out maybe tag well okay by tags I mean accounts tagged but also I think like keyword tags could also be another thing to look at okay I should like hashtags versus accounts tags yeah yeah that makes sense yeah yeah hashtags um so I think that's that covers a lot of like what would be in a post and then the account making the post we might want to see like how many followers yep they have uh maybe the number of accounts following or followed yep so followers and following following um and I think maybe like looking at a ratio between that might be a good idea as well um say that one more time I missed that maybe like the ratio between okay the number of counts that they follow and then how many followers they actually have yep um I also think that if you want to look at the followers Maybe um who the followers are because for example if we've already labeled a few accounts as specific spam accounts then when we I mean if if one accounts followers are completely spam accounts that we've already labeled then it's very likely that that account itself is a spam account right so um are the followers spam accounts um and something else could be um like the types of posts that they make like how repetitive are the posts that they make so if the account is posting the same thing every time like hey check out my like yeah YouTube video and the same link every single time then it's likely that that's spam right um content of account posts and I think one more thing to consider could be like for the content of the post and the account like so I do think that you guys have probably a report spam feature correct yeah yeah we do okay so I would say like I mean something that we could do is how many times has it been reported spam [Music] um or like how many times is something similar been reported spam but I do think that is a big indication in terms of both the content and the account itself yeah definitely I think this is a good definitely a good starting list um anything else that you can think of and maybe kind of think a bit outside the box here it doesn't necessarily have to be outside the box but um I guess one thing that I immediately see is that you know all these things are very publicly facing I'm just wondering too you know being the system administrators here is there anything else maybe on a more internal level that we might be able to factor in on the the Bots well okay so I do think that one internal thing would be like this reported spam yeah definitely I also think maybe another could be you know you mentioned earlier that there were these coordinated attacks so I think if you see like across like if you see a spike in the number of posts being posted within like a five minute interval or something like that and the content on all these posts is somewhat similar like the I don't know document distance is someone similar or something then perhaps those would all be um though that could be like a coordinated attack right yeah yeah okay yep so like similarity in amount of posts so if you just see this massive Spike then basically that might be some sort of indicator so somehow to capture that Spike yeah as it is so maybe all yeah so maybe I'll add like a separate category and just be and this could be like um aggregated action [Music] uh or aggregated things so maybe one of them could just be similarity of posts um in small time frame [Music] um something else that Twitter might have access to or the platform might have access to uh okay let's see I have like well okay so things that are private to an audience but are public to internal uh Twitter might be the emails even email addresses of the account yeah so that could go under the account making the post so like email address um I don't know whether or not you require all these emails to be verified but maybe like maybe if it's not a common domain or maybe if um maybe if like there's a lot of random letters in the email like the prefix then those could all point to um more spam-like and I think that this could be a little bit tricky because you know back in the 2000s we all made like really dumb email addresses right uh so I think that this could get a little tricky but I definitely think that there are indications of whether or not something is more spam-like definitely and I think that you made a really good point there with domain name uh just you know for context here too like uh you know you're more likely to be able to trust a Gmail address maybe than a you know random address you've never heard of so like some sort of waiting scheme could be interesting here right uh I guess in a size yeah that's why I was kind of like hesitant about that as well was because like for example you might have your own domain name right yeah geethkali.com or something like that and so your email might be like info at this rent this like custom domain name in which case we would want to make sure that people using that email address that does not get caught up in this like spam filter and filter too harshly that just because of the email it's considered spam yeah I think the way that I would look at it and I think that you're kind of hitting on this is that it could be an indie like I think a good domain address could be an indicator of like a valid account whereas maybe a bad one doesn't say that much I think that's kind of the point you're making yeah interesting thing just to note about Gmail just because I've been through this experience Gmail very much limits the amount of accounts that you can make um for a certain IP address so that definitely would help you know capture some positive emails because they could only make so many easily uh cool I think this is a great list I think the only other things I might add here is just like post engagement but I think this kind of encapsulate is encapsulated in a lot of the other stuff like you know you post engagement likes retweets all that or yeah as well as looking at the the details of who is liking these things if it as you mentioned with the followers if it's always the same spam accounts that we've identified previously probably not good engagement not organic engagement so I think this is a great list for this portion I think we can kind of move on to some of the other questions about the task but we'll keep this here for reference so I'm going to go and scroll down a bit on this post um the next thing I want to discuss and you know you can maybe start jotting up some thoughts on this is like okay we have all this idea of what we're looking for I mean a big portion of building let's say our goal as a team was to build a model that could identify and potentially remove these Bots um you know a big portion in training that model is having you know a data set to work with so I think thinking about the concept of a data set here you know what might that data set look like and how would we go about one sec let me just write this out as I speak it out yeah and then how would you approach um collecting I guess that data so kind of an open-ended question but I think it's a good one to you know start discussing yeah yeah so um I guess to first answer this question I would kind of look at this first part of this question like what would a data set to train models look like okay well so what exactly are the models that we're trying to train right and I think that you know of a lot of the the models out there um I think many of them would look very similar and I think that specifically uh we want some sort of feature Vector to be able to feed into these models and that figure Vector well it should probably summarize all the things that we just mentioned above right so um I think in terms of so I think this question also comes down to are we trying to tag the post specifically like do we have a model that are detecting spam posts or spam accounts and those are two different questions that we're trying to ask right yeah actually yeah which I would say in the context of this question what I was thinking about was a I think that it's good to you know thinking about both because you know one model probably wouldn't work for the other use case but I was thinking specifically about bot accounts and trying to detect those bot accounts and then you know what the data set might look like to help train the model to you know predict and and detect unknown right you know accounts okay so if we're trying to specifically Target these accounts then I might focus on more of the attributes of the account making the post that we had listed above so that so uh the bullet points under that are such as number of followers number of accounts um that follow them so count yeah and then um are the followers also spam accounts maybe the content of uh the account so maybe we can look at the last few posts um how many times this account has been reported spam the email address of that account and I think that the most basic thing that we can do is just feed everything into kind of a neural net right um I'm Gonna Leave This so for certain attributes I won't mention them yet but like a feature Vector for that might look like a number of followers number accounts or I guess number following that would be a better way to describe it those are very um black and white numbers right so there's no question about whether or not we can beat that into a neural net yep or some other model um now this next question are the followers spam accounts so that's kind of this like almost circular logic so I will maybe leave that off for now um the content of the account I will come back to that and then how many times it's reported spam so number of spam reports on this specific account yeah um and then the email address okay so I think that this is a good start but obviously it's not enough so let's talk about maybe let's go with the email address first yep so the thing is in here I would want to maybe [Music] sorry letting the sirens pass [Music] can you hear it anymore I can't hear it before at all oh really I never heard of sirens oh oh okay um so right now we have three different things the number of followers number following and then the number of the spam uh reports but I also think that maybe there's some value that we can uh assign to the email address of whether or not it's more spam-like or whether or not it's safer right and well I I'm maybe you're gonna get into this but what might like a simple or you know model for that look like right so I think that I mean okay so a very basic basic basic model it could just be you have all these domain names that like you know of um that are popular that are common out there and we would want to get you know many different countries not just what's popular here in the US but suppose you have a bunch of like domain names that are popular then this could be a very simple like one zero like whether or not it has that uh opposite of prefix that like domain name in the email address right so like one for uh Gmail oh okay oh you can't see that okay but anyway one for gmail.com hotmail.com uh yahoo.com Etc yeah um and then zero otherwise yep so that would be a very basic value for email now to get into things that are maybe something a little bit more complicated um we could even train a classifier on just emails right if we had an entire data set of emails that are um that are good and bad that are you know not spam and spam um we could train a classifier on that which would spit out a probability or yeah which is put out a probability of You Know How likely it is to be spam or not spam and so maybe certain attributes of that email address could be in fact this domain name uh so maybe we have some sort of um one hot encoding system for these so one hot encoded uh domain name and then Maybe can you just quickly explain what a one-hot encoded domain oh yeah what would that would look like just so clarify yeah yeah yeah so uh let's say that so basically in zeros and ones to encode a certain value and so by that what I mean is let's just take these three Gmail Hotmail and Yahoo yeah and put them into like a list like this right so position one would be Gmail position two is Hotmail and position three is Yahoo so one hot encoding for this for like something at gmail.com this email address would then be encoded as one zero zero then maybe something at hotmail.com would be zero one zero and then something at yahoo.com would be zero zero one now if I had something at random dot ABC then this would actually get encoded to zero zero zero got it but basically what one hot encoding is doing is it's trying to assign a numerical value to something that is not numerical because our computers are very good at understanding numerical values right trying to assign something numerical to something that is not um and it's trying to make it so there's no association with one another so something that you might be asking is why can't we do one two three four and just call that a day right that's because this is telling us okay one is closer to two numerically than it is to three but it shouldn't matter in our case because Gmail should not be closer to Yahoo than Hotmail is to Yahoo right they're just different categories they're not more similar or less similar so that's why we're using this one hot encoding got it cool um and something else that we could do so I do want to use like the characters in the email itself yeah for this classifier um I'm trying to think of like how I would do that I might well also I do think that if there's a name or a word yeah in the email it's probably more likely to not be spam yep right so I think that maybe like having a list of common names and common um common words yeah that we could check that again so that could just be like a zero one binary encoding for whether or not common names or words up here in email and I think this one's also kind of tricky because I do think that you have to consider different languages right and and so this this would be something to think about if we were actually building this model is like we would need to think about the different languages and different names in different languages and stuff like that um so for now what I might do is just this simple encoding scheme where we just get a bunch of common address um domains and then zero or one for that I think that would suffice for now yep um now in terms of this now let's go back to this like content that the account is posting so I do think it makes sense to not scrape the entire like Post history of the account because I think that would that might be excessive like I think if it's a true spam account then like the first five or ten should suffice right so I think what we could do is in here have certain features for like post number one yeah post number two Etc to post number let's say we do 10. yep and now scrolling down a little bit um the indent all of this why did it make a bullet list what was it a bullet list I guess some of it was no but I I made it I did that nice yeah um just for organization yeah I appreciate organization yeah so this is basic including something more complex would be this okay so now let's think about the posts so with the posts um that would be a lot of this stuff from up here which I'm just going to copy and paste down here Okay so the first one could be accounts tag okay what are the simple ones the simple ones are things such as um how many times number of spam reports for that specific post yep um Maybe okay the time timing of posts I think that can be easily converted to uh like an Epoch value or something um and the only reason why I want to consider the timing of the post is because if for example the last 10 posts were posted within like one second of each other yeah that probably indicates a Spam bot right um so that's the only reason why I want to consider this timing and I think that it would make sense to maybe use some sort of integer representation so maybe like seconds or milliseconds since Epoch um like so yeah uh and then next I would think about and just uh just because you know with our team we have to communicate these things to you know business stakeholders that sometimes don't understand some of these details could you just kind of a very I guess simple terms what is it mean milliseconds since epoch it's like I think it's January 1st at midnight on in 1970 or something like that um but you can think of it as literally just like number of seconds or number of milliseconds since a certain time period like 30 no 50 oh man 50 years ago yeah um yeah so it's basically discretizing time such that we can measure it as an integer yep um okay so then next I think we have to think about all these tags and the content in the post right um I do think that one easy thing to do could be like for the words mentioned in the post uh to have some sort of again like one hot encoding for very common spam words yeah so so just like um claim free uh can't think of all the ones but yeah yeah yeah I'm not I'm not a Spam bot yeah but that's good that's good we're not hiring spam Bots right now so yeah so maybe like a one hot encoding of common spam words cool um and I think that maybe this could also be similar for like the hashtags as well but I think that this could Encompass the hashtag right because if we're just looking for whether or not these words appear in the entire string of of words then like that would cover the hashtags as well cool and I think honestly I'm just looking at the time here I I do want to kind of switch gears a bit I think this is a good you know high level description of it implementation I do want to just ask some kind of follow-up questions to these things yeah so I guess just circling back first I think that really you dove into the details of implementation I think that this is a solid approach using this feature vectorization approach and kind of you know I think you did a good job explaining some of these details but I do want to revisit kind of the uh one of the initial questions which was you know we have this way of building this model with these feature vectors but you know at the end of the day we need to train these models and we need a data set to train these on so I guess I just really want to drill down specifically on that like what would the data set that we need to use this approach maybe look like and just any I guess thoughts on that and I'm going to just break this apart real quick and just say um implementation approach or something like that to separate it a little bit visually does that make sense what I'm asking yeah yeah so you're just asking like how would I collect that data yeah the data that I just described well well yeah so I guess what I'm thinking about is the data you just described I think is fairly straightforward to collect but at the end of the day we need a I guess the two questions I want to drill down on is what makes an account a human and what makes an account a bot like how do you I guess get data on bot accounts and bought and maybe human accounts like because at the other day we need that to make this a meaningful model right so I do think that um well okay so at the end of the day who is the person who can distinguish the bot right it's humans it's humans because when we get annoyed at seeing spam or something like that like that's when that's that's what makes that's what literally makes this a Spam post it's once some human it decides I don't want to see this content and this is spam so I actually think that one of these things that I kept mentioning which was how many times has this been recorded spam I actually think that that could be our label right for whether or not the account is Spam or not yeah so um this this is how we can label spam because at the end of the day this is a this is a classification problem it's is it spam or is it not spam um and what our data is collecting is we're collecting all of these features but then we need a label for that feature right and that label we want two different categories is it spam or is it not spam yeah um and obviously the more time something has been reported spam the more likely it is to be under category one yeah right so I think one of the most basic ways to do this uh could just be if it's reported spam over blank number of times then make that like then label that as spam so naive labeling system could be reported spam X number of times then label spam and this X this is really up to like what you're like how much data do you want to collect right because you don't really want to kind of fake that data because you know as I was trying to come up with these things down here it's it's pretty hard to fake being spam if you're not spam yourself yeah um and then for the non-spam I would just do like if no reports of spam then uh label as not skin and also I think that these no reports and this reports fam like I think that you should put some sort of time frame on this so like since account creation uh for longer than a week or something like that if it's never been reported spam or if it has been reported spam then I think that can give you a good um indication or maybe not a week you would have to set the threshold right yeah but you don't want like a brand new account to be part of this system definitely makes sense okay I I you know I think this makes sense as a naive labeling system you know my big concern is here it is like I guess what are the problems potentially with this approach like you know how could we build this make this more robust I guess either you know maybe identify what could potentially be a problem here or potentially just mention an idea or two about how could we make this more robust yeah so I definitely think that this is relying a lot on user feedback right it's relying a lot on whether or not somebody has reported something as spam yeah um I do think like I mean okay so one slow but sure way that you could label uh certain things a Spam or not spam is that you could just have like like you could just hire people to go through different posts and different accounts and say oh this one is Spam like we should use this as um a Spam account to model to model after or to include in our data set sorry but that would then be also kind of tedious slow um you would have to pay more just to collect data that you already have um another system might be hmm um sorry let me think for a second no problem so I really I do think that this like this reporting spam is a good indication because it's similar to how like waves works yeah right how like oh is there well okay one thing that you could do is you could just Implement something that makes it more incentivized for user to correctly label things as spam or not um so for example uh you could have a larger an easier system of reporting spam or um you could have literally a direct question of whether or not something is Spam or not yeah um because I do think user feedback is probably one of the most important because after all you're building a product for your users right and like their feedback their classification of whether or not something is Spam or not that's going to be what's important um so maybe like if there's some in like maybe we can start tracking some of these things and just coming up with something naive like things that are obvious as spam accounts um if those specific attributes pop up then we could automatically just be like oh do we think that this would be spam like you could have a poll that presents to the the user that would make it easier to label a Spam or not yeah um I think I think that that suffice because that kind of answers my question I think the only details that I would add from my end here is that you got to keep in mind that there's a lot of trolls on social media a lot of people that are Bad actors and when we're designing a system like this we need to keep that in mind there might be targeted attacks people labeling someone they don't like as spam so we also need to factor that in and so that would be one component I would add here another thing I would just quickly mention too is I I don't think it's actually a bad thing to do a little bit of human annotation I think a lot of you know tasks sometimes require that I think the important thing is we want to cross-reference our annotations because we might be paying a group we might not be monitoring them super closely so we want to make sure that they are not just taking a random box so you know something just think about right there right cool well I also do you want to add one more thing to this is like maybe your threshold should not be like number of times but maybe it should be like ratio of like spam reports to engagement or something like that right yeah um because even if there are many Bad actors I do think that most people uh would act the way that you yeah would think that they act so I think like there can't there probably aren't more than like 10 or 20 percent like people who would deviate from that expectation so for example if like I think that's that's the importance of this threshold right that's the importance of saying well how strict do I want to be such that um like do I want the the stricter I am with this threshold the more confident I am that this has to be spam yeah right and so that's that's the importance of this labeling system is how strict do you want to be because the stricter you are the less data you have to train on is another consideration yeah okay that yeah that makes sense and I I think that that ratio idea is an interesting idea and you know the only really way to know is to actually try this out test it so it's hard to know with certainty but I think the important thing is having these different ideas that are worth playing around with and uh attempting okay uh I'm gonna ask one just because we're running uh up on time I'm going to ask I think maybe just one last question on this problem I'm going to stop sharing the screen you don't have to use the document anymore we can kind of just chat about this I'm just you know from a you know let's start thinking a little bit more about implementation of this um you know thinking about Python and maybe the larger Cloud ecosystem like you do kind of have a high level feel of how you might approach the actual technical implementation of you know building this model or anything that keeps comes to mind so maybe certain libraries certain uh Frameworks certain just like Cloud resources Etc I'm just curious you know from a implementation standpoint anything that kind of sticks out that you know you'd probably do from approach yeah so I would definitely say I'm probably the most familiar with like tensorflow so a lot of um how I would be modeling this is I would probably start with my data set and then I would probably go into like a Jupiter notebook or um a collab notebook and I would try to mess around with it just to see you know what what is the simplest model that I can make to to have some sort of okay-ish um classification rate right and a few things that you could use in order to figure out how well your model is training like it depends on how important to you false positives or false negatives are right um and like more so than accuracy like I think those are things that you should be looking at and also I think that um yeah anyway sorry to get back to your original question is that's how I would start this and then in order to deploy this I'm actually not that familiar with deploying uh certain models but I would assume that you use some sort of like uh GPU to accelerate the training and then um or yeah sorry this is during still the training phase but maybe you're training a much larger model than the initial one that we came up with in our notebooks yeah so you would use like GPU some other resources to train it to accelerate that training process you would parallelize um and then when you actually deploy it I mean everybody uses like Cloud technology these days so I I feel like I just deploy it to some Cloud that serves the model and um and every single time we have a new post we can just filter it through or not a new post but maybe um some new account that's flagged or I don't know something like that we can just go through this model and say oh this might be a Spam account yeah would we want to do it a single post or would we want to like I I guess I'm just trying to think about that detail as like you know each post goes through this or is any other things that we might want to do here well so the way that we implemented this it would be the specific accounts okay right um so I I do think that you would maybe wait for an account to have a few posts but then well so how I think the tricky part is how do you catch a Spam account before it posts any spam right and so the way that we've set this up it's it's to kind of go through existing users and label them as fam or not I do think it could be interesting for a future model to to maybe assess the first post that an account makes and determine whether or not it's a spam uh Post Yeah and then from there you know if it's if it determines that it's not and there's no way like maybe these newer accounts should just be ones that you keep monitoring for a while and you keep reassessing whereas older I guess you would say Legacy accounts might have um might have more lenience right yeah unless maybe a ton of people start flagging it yeah but uh but hopefully our hopefully our other implementation would be able to catch that yeah that I think that makes sense to me cool I think that's all I have on this prompt uh this was an enjoyable discussion uh I think you know it's interesting to think about I think there's a lot of different Avenues you can go so right I know we're running up on time I do want to just I guess quickly pause and uh just see if you have any questions for me before we kind of conclude this uh interview yeah definitely so I'm just wondering um like what are your favorite and least favorite aspects of you know your role as a data scientist at this company so far yes I think the big one is I mentioned it at the start of the interview is just um the team the team is really impressive you know it's a diverse set of backgrounds both people you know PhD type people from Harvard MIT but also just some really really brilliant self-educated kind of just learned the stuff on their own bring a different perspective to the table people that are on this team so you know from my perspective you know one of my biggest goals you know in this role is I want to continue to grow I want to continue to develop you know certain skills that I have like I come from a natural language processing background you know I can apply those skills but I don't have as much of that systems background I don't know how to you know deploy this on a production scale so you know what I enjoy um you know most I think about this role is just the opportunity to gain you know those insights to learn from people that have expertise and skill sets I don't because I think that that is helping me become a more rail rounded um you know member of the team I guess on a separate note like that's very much like team based but also there's just a lot of flexibility to move around in the role like I've kind of stepped into more of a managerial type role recently whereas I very much was more of an individual contributor writing code you know on the front lines previously so I like that too uh on the flip side I think at least uh what I enjoy at least uh you know this is I guess love hate relationship but you know work here is a bit of a grind like we work hard hours we you know it's uh you know to be classic and cliche you know Work Hard Play Hard wait sorry yeah Work Hard Play Hard that's the saying um so I mean it's uh we put a lot of hours in there's late nights sometimes but I think with that you you do get that learning but it's just like sometimes it's a balance it's like just making sure you know when to step back and and reassess and just like not get too burnt out so uh love hate relationship on that side but I think that overall I would take that over not being challenged yeah okay awesome um and how long have you been working there so far uh I think this is my fifth year so you know it's been a little while so you know I've yeah because I've had the ability to kind of switch around my role I think I've stayed kind of engaged and yeah I've just had opportunities to learn different stuff and I think that's kind of what has helped me stay this long right yeah yeah awesome because my follow-up question was going to be like what opportunities for growth are there like within the company and how how does the company support that growth yeah yeah um I mean there's a couple things I think just in general from from being in a team environment where there's a lot of just driven people that want to know the latest you know the state of the art like sometimes it just naturally kind of comes out during lunch discussions and whatnot some people are working remotely some are at you know the office but you know both channels like you know if I'm at the office we go out and you know some of this just the discussions just happen naturally you know at lunch or we sometimes host remote kind of like you know Zoom lunches where we all kind of just hang out and you know chat about things so that's like kind of an informal way from a formal perspective like we try to you know make sure that everyone that's here is Happy uh we you know have routine um you know performance kind of check-ins and perform like I don't I don't say performance as a like you know very strict like if you don't meet certain bar you're out but it's very much a constructive process where everyone on the team no matter who you are whether you're you know the CEO or if you're you know a junior engineer like everyone has performance reviews someone on the team is tasked with evaluating and having that conversation what you're doing well what could be improved so there's like constructive feedback across the board and we try to do that um awesome either every quarter or every other quarter typically every quarter we try to have something yeah sometimes it's more formal sometimes a little bit less formal but everyone's critiquing everyone no matter if you're the you know CEO uh or you know you're more Junior awesome um and then for my final question just how like what what would my first maybe six months kind of look like what would uh be some of the projects that I'm working on look like yeah so I mean I think one of the most important things from my perspective for anyone joining the team is we don't want to have new team members be siled we want to make sure that they are exposed to what we have kind of out there on not just my team but you know surrounding teams so a couple of things that you'll be kind of doing is first off you'll be you know assigned immediately upon arriving a mentor that can kind of help guide you and um you know you'll specifically like this is a role for you know my team so like they would be kind of introducing you to our systems walking you through the code base you know introducing you to the other people on the team so one of the big things is like you have this Mentor that's kind of there to guide you everywhere every step but another important point is like we try to make sure you get one-on-ones with everyone on the team within the first month approximately at least within our small team so like you get to meet everyone on the team you get introduce the code base so it's really just kind of getting warmed up getting comfortable seeing things yeah probably you know end of month one to like you know month three a big part of it will be you know tackling some kind of more entry level tickets on the systems really get your hands dirty plug in and and you know they won't be the most complex you won't be designing a system from scratch but taking our existing systems you know ironing out some bugs all that get more comfortable right uh after that you'll be kind of more potentially leading not leading but like kind of help co-lead certain initiatives yeah and that has a little bit more flexibility you can kind of get different ways whether you want to be really the one grinding out the code or you want to be more thinking about the design and all that and I think the only other thing is like kind of throughout this process we try to also introduce you to some of the teams that we work closely with so you'll definitely have some interaction with them but that does that answer your question yeah for sure yeah awesome well I don't have any other questions awesome so I think next steps from here is there's going to be a software engineering uh interview so that will test more your python skills and then also um likely it would be like a behavioral interview as well where uh just great engage a little bit more of your career interest and all that so next steps you'll hear back with us in the next few days probably and we'll go from there okay awesome well thank you so much for taking the time yeah it was a pleasure and uh nice meeting you and enjoy the rest of the day yeah you too thank you see ya bye all right end interview I'm not Keith Cali anymore I'm Keith galley you put me on the spot answering uh some of those questions at the end I had to tap into my uh intro to acting skills to like fill it but I mean honestly I feel like I could answer it how I would answer it if I was in that situation and kind of from prior career experience but I yeah guess what were you saying well I feel like a critical part of interviews is asking questions to look engaged and uh and like genuinely interested in the role right so I didn't want to be like yeah yeah yeah though that's fair and the reason I asked that is because that is an important point so I I just I wasn't I guess preparing to be on my toes and like getting ready to answer these but I think it was good that you did ask those questions and I think one thing that might be interesting is for you to like evaluate kind of my breakdown my responses and stuff and what you're looking for because I think one thing that I think about in these interview settings and one thing I think is important to keep in mind is like I very much it's an interview for you but I'm very much also like you're interviewing me and like for me when I was doing a lot of these interviews like I definitely got stressed out when I was interviewing certain people that I knew were like high level we really wanted to land this person so I feel like uh like keeping that in mind as an interviewee that like you know this is not a one directional thing like they're like I had nerves you know going into this even though this is a mock interview because like I wanted to ask the right questions I wanted to get the conversation patients doing the right way and like that's definitely how I felt in real world situations as well where it's like it's definitely both directions especially if you really are looking to land a top candidate yeah so I definitely think that like overall I mean I enjoyed the interview I thought it was pretty like conceptual which is good I don't like when a lot of it's just memorization yeah um and I think that in terms of the questions well okay to be honest I wasn't like 100 listening that much because it's not a real company yeah um but I do think that I do think that being like concise and your answer is probably uh something good to practice so I do feel like um you did talk about a lot of things where you could have just said well I think your favorite thing is this because I have a lot of flexibility and like I have a lot of growth opportunities and uh that's really helped me Propel my career whereas a negative is just that I feel like you know work is kind of a grind sometimes but like but we're all here because we're really passionate about the product and so like we're always like we want this to work and succeed and I want to be a part of that team so like we work hard to make that happen just like being more concise with those responses um and I feel like you know I feel like when I ask these questions it's a lot of times me evaluating whether or not it's been somewhere I'd want to work right so like if somebody does say like oh the work is a grind then it might like that is probably one thing that I'm prioritizing is that like I want a work-life balance or yeah um so I guess like as an interview or something tricky that you could do is like before asking before being like Oh do you have any questions to kind of get a feel for what they're looking for and then yeah your response to that but also at the same time I feel like then it's not very like as an interviewee I want people to be honest and genuine with their answers right like I want to be able to evaluate for myself like like oh I I'm gonna expect to work nine to six every day and then have the rest of my time I'm gonna expect maybe once every once in a while to have to be on call but not every night and if then the reality is different from that then I would be very disappointed right yeah um and I would want to leave and so I think that those are things to think about but I do think like honesty is the best policy yeah um this is interesting because in all the interviews I did I never really got the chance to ask that type of question of like you know evaluate my own responses and trying to impress a candidate yeah well I definitely think uh one thing is like even when you're interviewing uh you think you like add a lot of fluff and things so I think that for example when you ask me things like oh did you get the question like you didn't even ask a question yeah at that point when you were like detailing the problem and I think maybe just adding more breaks in between kind of your monologue in a way like just being like oh does that make sense to give the interviewer you know some more engagement also a chance to ask questions is probably good and I don't think you have to rephrase something multiple times like when you ask what are some features that we should investigate regarding the bot issue just ask that and then if they have any questions they'll like ask specifics right yep that makes sense and I feel like I goofed there I definitely even in my notes I kind of wanted to pause and just like see if you understood the like overall I guess like foundation of the the it wasn't a question that was the issue I specifically didn't write question there like the premise of like the context of the uh the problem context I guess that was it there was no like question yet but I even though I wrote in my notes that I fixed I removed question it just did like problem context I think I said it out loud again so it's like good to know I think it's good to know to not be too fluffy I think I like doing that um so this is good insights from my perspective I do want to just like as part of the uh video also like give you some feedback I guess last question before I do that is like I guess any other feedback on I guess the premise of the question and like just the flow of that type of interview um no I do think this was a very interesting interview because it's not straightforward right it's not like I can just mention the various features that I would put into a model like I do think that it required some time to think about um I do think it was very open-ended and I think that's good in some ways because it forces me at the interviewee to ask for specifics and um and you can you can tell a lot about a candidate by the questions that they do ask and I also think that for example when I was trying to go through you know what would this data set look like or how would you approach collecting that data like I do think that um the one tricky part that I kind of faced was after I did this like naive labeling system you're like well what is what is wrong with that and in my head I'm like well obviously you can do something more complex but like I think as a baseline like there's nothing wrong with like what I said right because it just depends on where you set that threshold so I would actually unless an interview interviewee is like blatantly wrong I think from the perspective of an interviewee like I don't like I'm also under stress right like I don't want to just I don't want to hear like oh that's wrong yeah um you did say it was like wrong but you just said oh there might be some flaws with that like like can you think of anything and so I tried my best to think of some things but I do think that like maybe if you had said well this you know user feedback is really great but can you think of anything specifically with like this specific part of that that things might go wrong or something like that yeah I think the ish the challenge is a balancing act it's like on the one hand I don't want to be leading I don't want to like be like hey you could think about this specific thing so I try to like I think maybe I could be more positive instead of saying what's wrong you know just saying how can you improve this and just leaving it at that I think that would also be good because I think when you said like like there are some flaws here it made me think like oh this system is just like yeah like I should just not be using um like reports of spam but like I think that you genuinely should be using reports of spam if you're trying to label your data right so I think like I think if you had asked like oh what's a way to improve upon the system that would make me think okay I'm on the right track but yeah what are some flaws that I can think about there yeah I definitely agree with that I think just one note to add to I think that from an interview's perspective I think that that's the best approach that you what you mentioned from your perspective one thing that you could potentially employ and whether or not you want to do this it's kind of up to you I've seen this before in real interviews that I've done is like you're welcome to push back at that if you do it in like a you know a I don't friendly is maybe not the way but like in a professional way and say actually I think that this is a fine like I don't think there's anything inherently wrong with this I think this is a good Baseline approach we can improve it like you can kind of spin it that way too obviously with nerves and stuff it's tough to do that but like I've been at an interview where we like we kind of got into a debate but in that I learned a lot about how they think that they're willing to stand up for what they believe in and like they weren't ever aggressive but they were able to like stand their ground argue their points but be willing to listen to my side and like kind of we could come to an agreement so whether or not you want to play into that you know is up to you sometimes interviewers might be cruel and they might purposely get your nerves and be like make you think you're doing something wrong even though you're not but uh yeah it's an interesting I think from both sides the right you know it's a chess match a bit there um yeah no I agree okay I don't have anything else yeah so from my perspective just to add some details I guess I mean overall like I mean some high level things like the reason that I would approach an interview like this is because if this is like a high level position I assume you know how to code I assume that you know like the fundamentals that you can program that I'm not worried about that as much I'm worried about how you are as a thinker and not even like how you are as a thinker but also how organized are you as a thinker because I could potentially interview someone that's very brilliant but they might just kind of ramble on in a way that I can't follow and I just like even though I understand they know that they're talking about I just I need to think about it in the context of a bigger team and if they're not being straight to the point and like being able to flesh out in a way that's easy to follow their points then that's not probably someone I want to work with even if they're brilliant so I think you know some of my first comments is that I think you did a great job at like not only explaining your points but also like being able to uh explain your points concisely I think once you started having to write down the Google Doc one thing I really appreciated and I don't always see is like you are very cautious to format it nicely which is a big thing for me as a reviewer it's like a small detail but I think it goes a long way where it's like it makes me it makes it easier for me to understand things it makes it easier for me to under like follow up on certain points so like structure there is good solid like formatting and all of that so like yeah all solid there I think overall like I think a lot of just good discussion I think you got to a lot of points without me even having to prod to that it's like you kind of got to some of the like Crux of like oh these are the features I want to do a uh you know build a feature Vector of all this stuff and like I had specific questions hoping that you would get to feature vector and stuff but you kind of just brought it up on your own which was solid I think it was a good use of like thinking out loud like making sure that you're going through your thought process and like you know saying what you're thinking as you're you know writing it out as you're saying it just it allowed for more discussion allowed me to just see how you think well I definitely think that like so because it's being put on YouTube I definitely think that one thing to mention that like people who are kind of newer to Tech interviewing and stuff like that like I I don't think it's common knowledge if you're new to the space to talk out loud during these interviews like I I don't think I think a lot of people just think oh they asked me this question I just need to give the answer no like you need to um like what they really want to know is how you think right not whether or not you'll get to the right answer because what an interview what an interviewer hopefully should be doing is if you're going on a completely wrong path kind of guide you to the like right answer right and I really enjoy interviews where it's more like a conversation where I'm not just talking to myself where I can bounce ideas back and forth because that also shows me like this is what it would be like to work with this person on a team um so you really want to be able to display how you think and how you arrive at the answer rather than just saying an answer and then justifying it yeah yeah definitely yeah I think that's great to pull you know to bring up I think that it is kind of weird it's kind of different and new same thing with like writing things down during an interview like we use Google Docs here but like traditionally if this was in person like it'd be a whiteboard or something I'm like I think some people aren't prepared for that too so like knowing that this might be what you have to do but I think yeah speaking out loud is super important um let's see what other notes do I have here so some like small details like overall like from my perspective as an interviewer like I thought you had a solid performance like I I like good performance like I'd be excited as an interviewer to come out of that interview like like this person hit the main bullet points that I had so like you know it would really come down to what do the other candidates look like but like definitely you know like sometimes you come out of interview you just immediately know like this candidate was not right uh sometimes you come out of it and you're like uh you know maybe like I definitely came out of this interview and I was like yeah that was a solid performance like this person knows what they're talking about they think clearly so like good from there so like I'll be nitpicky on things because like overall good performance uh one thing that you kind of did and It ultimately was my next question but like I would have liked to think a little bit more discussion when I brought up the data set you kind of skipped to the model implementation um and so like I didn't stop you I didn't pause you didn't bring you back because that was the next question but just kind of keeping in mind what the question was at hand whereas you ended up focusing on implementation which was important was the next question but I think like I would have liked to walk through that I guess in a specific manner so small nuanced detail like kind of annoying me being annoying but like just something to consider just keep in mind are you answering the question that was asked another detail this is kind of at the end and there is like little things that I'm like bringing up uh it's not like a ton of I think things one thing at the end I think I think it was good that you mentioned like hey I don't have that much information like knowledge in the system side of things like you shouldn't try to lie to your interviewer like I think it's really impressive when someone knows when to say hey I don't know that much about that so like I think it was good by like stating that explicitly and not trying to like you know spin off some information on random cloud services that you actually don't know much about because I might ask you details about those and then you might be like I have nothing so it could catch you in a weird situation I brought it up there was because you asked a specific question about the deployment of this right like yeah I would never have said it myself like I don't know that much about this but it was literally just because you asked the specific question that I did not know the answer to that I brought this up and like I don't think like I I think that at least to me when I used to interview people like to me it's it's okay if somebody doesn't know something right like what I want to see is that they're teachable and that they can learn and that they can think so yeah and I think you bring up a good point though because you're kind of the point is I asked you about it and you said I don't know much you don't want to call attention to things that weren't brought up you don't be like Oh I'm actually like a junior engineer like I don't actually like know that much code like don't bring up things yeah um I definitely see that uh sometimes in like interviews that I do bringing up things that they should bring up like be very confident professional like know your stuff fake it till you make it and if they like bring up something specific that you know you can't talk that much about be honest but don't throw that out on your own like let them bring it up so some things that could improve on that idea I think that it was good mentioning of like you know obviously we're gonna probably use a GPU probably gonna like paralyze things I think one thing that might have been interesting to say upon that type of answer to dive into some of the more technical details is I think that we would maybe just explicitly kind of mentioning like tokenizing maybe some of these tweets and whatnot or whatever you want to pass into the system um but also I think a big thing that you could have brought up was like batching and just explicitly mentioned like you probably don't want to take one tweet at a time because that's gonna like like if we think about all of glitter as a parallel to Twitter obviously uh there's so many tweets happening there's so many potential bot accounts that if we were trying to handle these on an individual level like it would just be in crazy and like intense processing so we would need to like you know maybe we have a window open for five seconds or something that like collects a big batch of these and we have our model set up in such a way where it can process that full batch and spit out that output and then we have to parse that output so that would be like I think one way to improve that but I think it is on the deployment side uh that is you know maybe not your expertise but I think the batching could have been something interesting to bring up um I'm trying to think of other specific components I think the challenges I like when I'm in an interview with two people where one person's writing the notes and one person's really leading the interview uh because I think one thing that's challenging sometimes is uh writing those notes down as you're trying to listen and let me see I don't have too much like I guess specific like criticism that I wrote down I might like review afterwards and provide additional things other like positive notes you know for people listening in whatnot um like I I appreciated when we talked about like you know anything that is exciting to you I appreciate it that you kind of like didn't just go oh chat GPT you kind of like mentioned chat GPT because I feel like that's right now what I expect everyone to say so it was nice to see some interest that was aside from that that was unique and different so like you know you don't want to blend in with every other candidate even if you want to say something like chat GPT offer something unique and different that someone's going to Remember You by and I think that talking about reinforcement learning talking about you know superhuman chess AI poker AI spins up a more unique conversation uh other I guess potential thing that could have maybe improved things a bit and maybe this was lack on my part of like I call this a question was it wasn't a question so it wasn't super clear but I think there are some other I think um problems with Bots that you maybe could have brought up I think you didn't add that much new details and this wasn't a data science question but from my perspective as an interviewer I'm going to challenge your domain knowledge like if you're at a company I expect you to know a bit about the company right um so like some of the things I think that potentially I think there could have been other I guess problems with Bots that could have been brought up that were kind of specifically unique and different from things that I brought up uh like one thing you could have even mentioned is like it might hurt our metrics we might you know if we have all these Bots like it's hard for us to report accurate metrics to people like advertisers that you know might need these accurate metrics to you know be able to budget properly so that could cause those problems down the road it's outside the box it's different but it might have been something interesting to bring up just to show you're really thinking about this uh other things like you know Bots and this kind of came up with the following followers and who people follow so it was kind of indirectly answered this but like if Bots are liking people's stuff really quickly on like let's say there's a reply and maybe a bot didn't reply to it maybe a human replied but that human account had a bunch of bots that immediately liked it it might give undeserved prominence to that reply uh tweet or whatnot as a pod issue so you know these are small things I'm poking at um but just it just takes things I guess to improve further uh yeah I don't think I have anything else yeah I don't have anything either all right that's all we're going to cover in this video If you enjoyed make sure to throw this video a big thumbs up and also subscribe to not miss any future videos if you have questions about the job interview process or have feedback on this video make sure to leave a comment down below huge shout out to Kylie for joining and making this video possible that's all I have for this video as always thank you for supporting the channel and until next time [Music]

Info

Channel: freeCodeCamp.org

Views: 178,833

Rating: undefined out of 5

Keywords:

Id: sD468LfeVdc

Channel Id: undefined

Length: 85min 4sec (5104 seconds)

Published: Mon Mar 13 2023