Solving Real-World Data Science Interview Questions! (with Python Pandas)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey what's up everyone and welcome back to another video in this video we are going to be walking through a bunch of real world data science interview questions and we're going to split this video up into two parts in the first part we're going to be doing coding interview questions using python pandas and then in the second part we will be doing non-coding questions that really make you think at a high level and are really judging how well you can approach new problems with kind of unique novel solutions i've spent a number of years interviewing and hiring candidates so throughout the video i will be kind of sprinkling in some advice that i have from the interviewer's perspective on what types of things that we are looking for in a successful candidate i think there's a lot of value in this video so i am pumped to get into it but before i do that i want to give a quick shout out to this video's sponsor and that is brilliant.org brilliant is an online learning platform that offers courses on topics that range from math to physics to computer science these courses are structured in an interactive manner so instead of reading long articles and trying to digest all the information you learn by doing they have fun examples relatable examples that make it a lot easier to understand sometimes very rigorous very difficult concepts in this video some of our solutions to these coding questions will be pretty technical and in order to really have a good understanding and a good intuition on how to approach them it's very important to have a strong computer science foundation a couple of courses that i think are particularly useful for all of you on brilliant are the courses offered on algorithms and data structures these courses cover important topics such as arrays trees sorting algorithmic notation recursion and much much more if you're interested in learning more and signing up i have a link in the description the top of the description of this video brilliant.org keith galley where you can get signed up for free also if you use that link and you want to sign up for an annual membership it will give you 20 off to the first 200 people that use that link once again that is brilliant.org keith cali all right let's get into the video all right in this video we're going to be doing problems on a site called strata scratch so it's free to use but they do have some premium type questions that you have to pay for but all the video all the questions we'll do in this video are going to be free and for your convenience i'm going to leave links to every single question in the description so to follow along check that out but basically on this site you have a bunch of coding questions a bunch of non-coding questions and a lot of really good stuff coming from you know fang type companies so very real world stuff so in this video we're going to start off with kind of a progression we'll start with easy then we'll go to medium then we'll go to hard on the coding side of things and we'll do something similar with the non-coding questions so to figure out the free questions you can go to free questions you can you know search by difficulty so let's say we want to do easy to start and we want to sort by python let us do this microsoft question called finding updated records and in these types of questions you'll see the actual question over here on the left and then the area for you to write code will be over here on the right and then you can like check your solution run your code over here so like we first ran our code we'd see our output is down here and let's let's go through and read the question so we have a table of with employees and their salaries however some of the records are old and contain outdated salary information find the current salary of each employee assuming that salaries increase increase each year output their id first name last name department id and current salary order your list by employee id in ascending order cool that seems like fairly straightforward let's see what type of objects we are working with so we're working with first name last name employee id salary and department id so what it sounds like to me is that for certain ids we'll have multiple rows in our data frame so actually we can actually see this already with the output so what we're going to try to do is just filter by those employee ids and only grab the max max row so feel free to try this problem using the editor over here on the right you can pause the video and then resume when you're ready to hear my solution the basic approach that i'm thinking to take is that we could just take the we could sort this entire data frame by salary so it's highest to lowest and then just take the first occurrence of the id over here and what we'll be left with is all this information but only the maxes of everyone so if we sort it like highest to lowest and then we only take you know each for each unique employee id we only take the first occurrence that we see going top to bottom then we're going to drop out all of that kind of fluffy old outdated information so that's how i'm going to approach it but there's many ways that you could do this all right so employee salary dot head and one thing i just want to show you too with how this platform works if i do check solution right now it's not going to work and we'll see kind of the so we'll actually see the solution so that's not maybe not the best but you'll see what's wrong about your solution here so when you actually solve it correctly this will say solved over here but let's resume so i think the first thing we'll want to do is do a sort values on this so i'm going to be saying some of these commands from memory but if you ever forget commands one thing that i always do is i do google searches i look up how to do something with pandas i look at the official pandas documentation and then another resource that you can use if you haven't seen it already is that i have a pandas python overview that's definitely worth checking out it will definitely give you the core knowledge you need for these types of problems but we're going to sort values and we want to sort the salary i just my voice just cracked and the other value that we pass in here is ascending i think by default ascending is equal to false so i don't think that we need to actually pass in anything here i'm just call this sorted salaries and then we can print out the head of sorted salaries alt enter to run code look at that that seems like it's lower lowest to highest i think yeah it's lowest value to highest value here so we actually want to make ascending equal to false and this is just a command i use so frequently with pandas that i just remember the keyword the the argument ascending look at that nice julie sanchez absolute beast with uh this high-end six-figure salary i like that um okay so now how do we drop like instances after the first time we see it i'm gonna go ahead and use google to my advantage here because i don't remember this command off the top of my head only take first occurrence python pandas okay drop duplicates and we do subset equal value this should be what we want and i believe it will work top to bottom you can look at the manual for more information but i'm going to try that so sorted salaries i'm going to drop the duplicates because if if once we see another occurrence of that id we drop it like we should be good and get the answer that we want so that's going to be id and let's take the head of that i'll call this df final df like result so that actually we'll call this let's be descriptive with our variable names like if you're doing an interview one of the big things that an interviewer will look look at when you write code is is your code easy to follow is it descriptive do you use meaningful variable names so i don't want to slack off let's like complete this like we would a interview question i don't want to take the head of that let's do a new line current salary current salaries we'll call this dot head and obviously that's not going to work because i defined this as current salary let's run that code that looks good to me i like that let's see what we want to output their id first name last name department id and current salary order your list by employee id in ascending order so the last step that we need to do is sort values by id the nature of it is to go low to high so this should be fine right here this i'll call result so if we then just output so the way you kind of output something is it will display the last line so if i just do result like this it should show us all this information that looks good and now let's see if we solved it properly let's go cool so this was a fun little uh microsoft question easy difficulty it wasn't too too bad but obviously it really depends on experience level this would be very tricky if i was just starting out with pandas but i've seen this type of stuff before but let's move on to another easy one so i'm going to go back to questions i'm going to look at what we have available to us airbnb i like that meta facebook yelp let's see what we got here all right i'm gonna do the last one that we see here number of bathrooms and bedrooms from airbnb all righty what is the question here find the average number of bathrooms and bedrooms for each city's property values or types i'll put the result along with the city name and property type so when i hear average number of bathrooms and bedrooms for each city's property types i feel like i hear multiple things i hear for each city so we're grouping by city for each city's property types we're grouping by not only city but also property type so that that's just screaming python pandas group by functionality and then we'll have to take the average once we group by those values so this should be a pretty standard type problem it is definitely like something that you would do all the time if you were in the real world evaluating you know like a csv and there's a lot of information in this property type city and then bedrooms and bathrooms we want to take the average of so pause the video try to solve it on your own and then resume it when you're ready to see my solution all right with all these problems i'd like to just run the code and see what we're working with here this looks like a fun data set to play around with i kind of want to see the raw data because i feel like you could do all sorts of cool stuff like this so let's look at this in a little bit of a wider screen all right some cool stuff a lot of information here one thing you can always do if you like are overwhelmed by how many columns this has you could always like start filtering out columns by just passing a list of what you actually care about so like property type city bedrooms bathrooms these are the four fields that we need to care about so i'm gonna just filter immediately by that ignore the other stuff just so that i'm not overwhelmed and just save this in like a frame called airbnb key details i think that's a descriptive enough name for this problem let's go ahead okay you can toggle this too if you need more room have fun like this alrighty i can also make my screen a little bit bigger all right and now pretty much every step of the way i feel like i like just sanity checking myself and using printing the head cool this is pretty straightforward i'm like very uh particular with how i do things so i like bothers me that property type is before city so i'm gonna do city actually what way i'll put the result along with city name and the property type i don't think the order will matter for the output but i just i feel like it makes more sense to me to see the city first than the property type there's no right or wrong answer here though uh alrighty so now what we need to do is we need to group by the city group by the property type so we can do that as follows we can pass in a list to the group by functionality city is the first column property type is the second column and let's just run that real quick what the heck and i think if we just do mean that we'll do the average of everything and we'll get a much more meaningful result here all right look at that that's averages but this is not super helpful for us because we don't see the city we don't see the property type there so there's a helpful little command that when you do a group by you're specifying the index that it is city and property type so if we do reset index it's going to replace the index back into that data frame so we should see four columns back here look at that ooh i'm i'm a boston native it's where i've spent like eight years of my life so i like seeing boston up top here i wonder if this is real data here i assume it is this would definitely make sense for averages for boston not much room but you la is similar new york city is definitely oh wow what the heck oh villa that sounds nice it's good with these types of problems too like what i'm doing right now is i'm kind of sanity checking do things seem to make sense and in this situation they do so i think we we take the leap of faith and we check solution also i might just save this as like results or like property averages i think that's a good descriptive name property averages run code one more time and check solution look at that let's go that was a fun one it was definitely different than the first type of problem cool so those are two easy questions let's progress into some medium and difficult questions they should be a lot of fun back to questions i'm going to switch the difficulty level to medium we're still at free questions all right next let's do how about this problem from google called countin instances and text we have over here the task is to find the number of times the word bowl and bear occur in the contents we're counting the number of times the words occur so words like bullish should not be included in our count i'll put the word bull and bear along with the corresponding number of occurrences okay this seems straightforward enough this is a bit different than like a normal data frame type problem but the way that i think of this is we very much can use a regex regex to match bowl and bear like exactly and not have you know uh bullish etc um so we'll look at reg x's so we'll combine regex's with pandas to kind of help us do these counts and then we'll have to do a little bit of extra work at the end this is definitely covered in the python pandas tutorial that i've done in the past so definitely check that out if you are confused and what are we given we're just given the file name and the contents so i'm going to run this real quick so we need to figure out all these contents from the file name how can we do that well there is what we're really going to need to do is basically do two filters and then count i i suppose so we could do we could just filter the rows based on the word so i'm going to call this bare words and i'm going to call this next one bull words and their words will be the file store and we want to have and i might have to say make this gfs equals google file store first just so we have an easier name to work with so it's a little easier when we're processing this so bull words and bare words and feel free to pause the video if you don't want to see my solution right away okay so it's going to be gfs and then we want to filter when gfs contents ah i cannot type dot string dot contains the word ball so we might just think you could just do bowl like this but the issue is as they mentioned like bullish would be counted here so what we actually want to do and is this all lowercase it's not all lowercase so like another pre-processing step we could do is like lowercase everything but we also can use the string contains there's a built-in flag you can pass to ignore case so i'll probably just use that instead so contents that contains bowl what we want to do is we can do this in reg x's so you have to know about regex's i can pass in a cheat sheet maybe let me see if i can find a cheat sheet so i'll put this cheat sheet in the description but the key thing we want to use here is what's called the backslash b the word boundary so that will capture spaces it'll capture the start of a line etc so it helps us kind of pick out exactly bull but not like bowl followed by bullish so we could do it might help too if we call this you can do raw text with slash b um [Music] so bowl like that and we should do be able to pass in the flag so i'll just share the documentation real quick contains pandas so the last thing that we'll want to pass is we can pass in this flags one of the flags we can pass through is eg regex dot ignore case so i'm going to import the regex library real quick and we will pass in the flag re.i so that's just the shorthand for ignore case and we can copy this the same exact code to capture the bare words there might be a way to do this in in a single line but i think it's just easier to split into two lines you could maybe make a function out of this if it made sense to you and just pass in the word poll and bear into the function but i think that that will basically work what we want to do though is let's like print out some things let's print out length of gfs print out length of bull words and print out length of bare words three zero zero that's weird i guess there's only three rows surprised oh i guess we're counting we're not only just taking the text we're actually counting it so we don't it's not a one zero type situation it's weird that this doesn't pass though i wonder if this would make it pass yes all three of them have both in it you had to pass this raw text because it was trying to interpret this backslash b as something else but okay so that this solution right here is what would have worked if we were just saying does it contain the word bull or does it contain the word bear in the actual text but because we need to actually count the occurrences we need to do this slightly differently i'm going to comment these out real quick so we can do gfs bowl count it's going to be equal to gfs contents that should give us the string and then we can do dot count let's think we have a string here with this what function can we apply to a string to get the number of words so the tricky thing here is we don't want to just count we want to do count regex matches python all right let's apply a function basically using the regex to this to get the count so i'm going to do apply lambda x so x is going to be the string value in contents so we could call this something more descriptive we could call it like text or something like that so lambda text well the result of it is going to be re.find all in the text object or find all that same pattern we are looking for in the text object i believe that's the syntax and then we would want to do just the so we could then take the length of the results here okay so we have this i think that's applied correctly we could then just see gfs real quick and see what we get there for that column one that looks good one looks good one is there any more bulls for this one it's kind of annoying because we can just literally count but i think that that looks good and let's do the same thing copy this line and do it for bear count and then finally what we need to do for the solution is at once it wants us to output the word bull and bear along with the corresponding number of occurrences so i'm not quite sure how they want this to be looking but i'm going to just call this bear call this bowl and then just print out bowl comma bear on code oh i guess then we need to do the count so and then if we reset index i think that will do the trick look at that that looks good hopefully they like this yay i'm curious what their expected output was i guess it just looks exactly like that so it took us a little while to get to that solution because we didn't do it properly the first time around but we kind of could use that that knowledge to help us get all the way there i am very curious in this type of situation the way that like the other solutions from users are structured so i'm going to just look at that real quick really interesting this is unique what the heck it is cool with this side that you can see this type of stuff i don't like this dot string dot count bear and bowl because i don't i think that bearish and bullish would find you know find their way into those i don't know if that's the best like you know there's different ways to solve these you can look through all the solutions uh the one thing you know if you're doing this in your interview type setting what i would definitely recommend is like write a test case where you have a data frame that has the word bullish in it does it trigger you know this does it do these lines of code count the occurrences properly because we only had three rows in our actual answer we couldn't really see i don't think but you know that's what you want to sanity check when you you know do this out in the wild all right let's go ahead and do another medium difficulty question i'm going to kind of scroll through these let's go ahead and do this customer revenue and march question from meta facebook question is here calculate the total revenue from each customer in march 2019 only include customers who are active in march 2019 i'll put the revenue along with customer id and sort the results based on the revenue and descending order feel free to pause the video and resume when you want to see my solution so pretty simple stuff here and see that we have a order date type datetime64 numpy type object we're definitely capitalize on that that all looks good looks pretty straightforward so it's really just to solve this it's going to be a filter by march 2019 and then a summation grouping by the customer id and then summing okay so filter group by sum i might even just write that down and like this is good in an interview like kind of just saying this is what you plan on doing so filter march 2019 group by customer id and then some results that's what we want to do so let's start with looking at i guess what it looks like order date's kind of like hard to process but we'll make this much easier by augmenting the columns a bit so we can go ahead and do order date let's just look at the dates specifically and then try to access using the date time component of pandas like maybe let's say just the year look at that so super super helpful is knowing that when you're working with date time types as we see we have here that you can do like dot dt to access all sorts of specific date properties and then you can do like dot year dot month a lot of very useful stuff so to make our lives easier how about we do orders year equals orders what was that field called again order date and we could even go as far as saying this is order year if we want to be very descriptive and that's going to be dot date dot year we can then do orders order month equals it should actually just be a single equal sign orders order date dot date dot month and now we can go ahead and filter so we're going to want to say that march orders mirage march orders equals orders and then we're filtering now orders order year equals equals 2019 and then and orders order month equals equals march and i don't know exactly how we should maybe just print out uh the orders order month real quick just to see what it looks like to make sure that we're structuring this correctly so we actually want to do equals equals three good syntax here would be to call march set it as a variable and just call it three just so we don't have a floating magic number so we can delete this line here we can uncomment this so equals equals march that should be good all right so we have march orders we've done this component now we need a group by the customer id and we could sanity check here oh i didn't did i not close this i think i need to surround these in parentheses it's kind of a weird niche thing about when you have multiple conditions within a a filter in pandas let's try running that again march orders that looks good to me and we need to now group by the customer id and then we want to just sum the value and we want to then finally take then reset okay i'll just run that real quick so you can see what that gives us we see these total order costs that's what we're caring about but what we want to do is we still want to see the customer id so reset index will put the customer id column back into our table and then what we need at the end here so we can call this like final result is just taking the total order cost from that and i actually i suppose taking the customer id and the total order cost and if we ah did i do something weird up here and then finally we can print out the final result oh actually we need a sum the last step was actually to sort values as well i'm gonna start march totals we'll call this march totals dot sort values total order cost ascending equals true false to make it in descending order and we can save that as final result and the ascending thing here that should be familiar from one of those previous problems we did final result that looks good to me final thing is to check solution we got it nice it wasn't too too bad pretty straightforward the key insight here was be familiar with date time objects in pandas and if you ever have a column that should be a date but it's like not letting you access the date or the dot year the command you're going to want to know is pd dot to date time i believe let me double check that yeah this function is super super useful if you ever need to convert a column into the date time object so you can use the nice fancy things that daytime objects offer you in pandas but cool that was a good little problem that is let us keep going so let's do a now a difficult question there's not many free hard questions so we have small choosing but i think we should find something good all right with these hard questions why don't we just start at the top and do this one from amazon called monthly percentage differences uh that we're gonna be using dates again so i think that's nice at least we're kind of trying to use some of that same knowledge we've already used given a table of purchases by date calculate the month over month percentage change in revenue the output should include the year month date and percentage change rounded to the second decimal point and sorted from the beginning of the year to the end of the year all right so let's see what we have that we're working with here because it doesn't really isn't clear to me from this description what the output should look like like exactly the format um so that would be one thing i would clarify if i was in an interview setting given this question so we're gonna have to take the months here so we can grab the date and then the month object like we did in the last problem are we only in like one question to see is what are the unique years are we only great about 2019 or will we see that we have other years as well and i think this is a small enough data frame that we can see everything let's see yeah it's only like a hundred-ish rows and it's all 2019. uh you might have to write like because we have a smaller table for the sake of this interview question we could just check that manually but this is the type of thing that'd be good to sanity check in code as well okay so for this problem again like i like taking a complex tough question and breaking up into smaller easier things it makes it a lot more manageable so the things that i think about here is we need a group by the month so we would first need to get the month as a column so like create month column group by month sum revenue and then we would need to sort maybe sort the month sort months in ascending order just to get like january february march and then take difference in revenue now this is where it gets tricky like the easy solution would be just use like a for loop or something but how can we do it in a more pythonic more pando's way that's what i'm going to kind of have to figure out on the go here but at least we could start with the first items can i shorten this anyway we'll see if we need to shorten it but month equals sf transactions we scroll down created at is what we care about here so created at dot date dot month so create month column now we need a group by the month sf transactions dot group by the month sum the values reset the index and this is kind of just coming from memory from the last problem i'm just following the same steps you see these steps over and over they become a lot more natural then we would want to take the month and the value as our kind of results all right so that is this we could then sort months in extending order and sometimes we need to set reset this to something like month revenue i would call this and then because sometimes you run a command it doesn't you know it runs the command but you don't reset it and save it as a new variable so you run issues sorted monthly actually let's just see if monthly revenue by itself is already sorted oh look at that it's already sorted so we don't actually need to worry about that but now how do we take the difference well one thing that is nice is there's a command in pandas just called diff like this so i'm going to just create a new column called difference or like value difference how about that's kind of descriptive that's going to be equal to monthly revenue dot diff i believe that's the command it's going to take the first row and then find the diff between that and the second row i forget if i think we'll probably have to specify that this is for the value column otherwise it would try to do it on the month column too and that would just get all all funky and stuff let's then see what monthly revenue looks like okay this is like super super useful here i think so we have this month's revenue minus last month's revenue divided by last month's revenue times 100. and so this could also be done with like a shift of something like where you basically just shift all the values down by one trying to think what's more useful we're already doing this in our value difference calculation so then we just need to divide it by last month's revenue which is just the reverse of that and then times 100. so i might just real quick also do like last month revenue equals monthly revenue value minus monthly revenue value oh sorry value difference this is another way to get it there are ways that you could just instead of doing this diff command get the like a shifted so you have all the values in the column so you have like you know january you have this value uh february you have this value what you could do is just like take that column of values and then basically take that same duplicate that column and then shift it down just a little bit i just don't remember that functionality off the top of my head and i feel like i want to get messed up when i try to use it so i'm just going about it in a different way both ways are valid and i would just kind of explain this in an interview setting while i'm doing what i'm doing i think this is very straightforward how it's working which is nice like when your code is readable easy to follow uh let's see what this looks like all right see we get the values previously so that looks good so now finally we want to get uh percent change that is equal to monthly revenue value difference divided by monthly revenue last month revenue just so you can see this neatly on one line right that's good times 100. i think that's exactly the command yeah that seems right and then let's see what does monthly revenue look like oh and then we need to round it to the second decimal point i forget how to do that often in my head so it looks like the last step really is to to get it in this format our year column and then round these percent changes to two columns so how do we want to do that let me just do date formatted equals there's this nice function called string format time that you can call on a date object so i'm going to use that dot string format time something like that there's certain like syntax that you use with that too so i'll just pull that in string like in and as you can see like i remember the gist of things that i need to do but i don't remember everything and how interviews usually work is like they might give you this as a screening question where you can kind of use your resources as needed to solve it they alternatively might give you this as a whiteboard type question in which case they wouldn't really expect you to know these syntax exactly uh they'd be more so just making sure you're thinking in the right manner uh i want pandas string format time you're with century as a decimal number that looks good so it's going to be percent y percent d this is very useful to know because i feel like in so many situations you want dates formatted in a specific way you want specific stuff so knowing this function is definitely helpful so what we need to do is percent y dash oh month oh sorry once month so i think that's just dash m i shouldn't have closed that so quickly i was thinking day oh it's percent m so it's similar to percent a percent d lowercase d finally we want that creates that column so final output will be date formatted and then the percentage change column oh and we still need to change percentage change to two decimals uh round decimals python round to two decimal places python pandas that's another string format type thing slightly different though round oh look at that there's a round function that's nice this is number decibels perfect easy enough so i'm going to just say monthly revenue percentage change equals whatever the percentage change was dot round to two series object has no hmm look weird am i doing something wrong here so i'm just grabbing the right column here there's just some what i needed to add was the dot date and then i needed to make sure we included date formatted instead of month because that was really the important thing once we actually grouped by month date formatted not an index here we're going to use some handy dandy printing i'm not sure why i'm getting this error this is weird what happens if i just reset this to create it at why is it not an index i want to try just specifically doing what happens when we do this that looks right where's the issue happening here i want to just call this result or dates formatted equals this i don't know if something weird was happening very weird very weird why is date not an index ah oh because you can't sum the date that's the issue here i mean we could just group by the date column now and then get date and value now created that's not an index what the heck okay that was an error down here all right that was uh that was frustrating i don't know what happened there that was funky stuff one thing that's a little bit tricky about this platform is that you don't get you know super detailed like line error messages but i think we worked it out this all looks good to me okay and this sounds right here all right it's come on baby let's see if this is right so what did we learn here i mean the trickiest thing is that when we ran the sum function here it you can't sum a date like value like it just didn't work so like we couldn't access that field when we tried to access it right here when it wasn't the field we were grouping by i was just trying to be i knew that month was working so i try to like ignore that but we ultimately ran into an error there okay we got the solution working properly it wasn't too bad but it was frustrating that we ran into that some of those issues but that's the real world that's what happens you do more of these problems you get better and hopefully you don't run into those errors in a high pressure interview setting let's do one more hard question before we move on to some non-coding questions so let's do this one premium versus freemium from microsoft find the total number of downloads for paying and non-paying users by date include only records where non-paying customers have more downloads than paying customers the output should be sorted by early state first and contain three columns date non-paying downloads and paying downloads so we have a lot going on here we have some filtering we have some definitely merging of like data frames and whatnot we have three data frames here uh you're gonna have to make sure that like you know for a certain date you've merged together both the paying account and the non-paying account so we'll probably have some group buy and summing because we'll want to group by and sum all the values based on you know the number of individual users that were either non-premium or premium that day that did downloads i wonder if there's yeah so you could have multiple downloads per user so a lot of information here i think one of the first things i see i think there's multiple ways to go about this but we have three different data frames and in order to get the date whether they're paying and whether what their user id is to get their the date and whether they're paying and the downloads we need to link them by this user dimension so i think this sounds like three joins or three merges so we could get kind of a data frame all by doing that so what i'm going to do is if we run this code we see we just have two columns what i'm going to do is i'm going to just call it data frame all equals ms user dimension dot merge well we need to use the ms account dimension and we want to put it on account id see what df all gives us here that looks good too just three columns now and then we could also merge on ms download fax on the user id which i think there's an easier way here i think i'm just going to do this on a new line so df user i'll call this and then df all equals df user dot merge with that bottom value so ms download fax and i like to do this first because it's just nice to have all the information in a single data frame you're not bouncing around and that's going to be on that they both have the user id i believe yeah so they both have user id so you can just do on equals user id that looks good so some key things here paying customer or non or non-paying customer or paying customer we need to filter and group by those two parameters and then like merge those all together and get some values here so what i'm thinking makes sense is basically we create new columns that's like paid downloads versus non-paid downloads maybe for every single column here and then we can filter just by the having those two columns to kind of our final uh data frame that we're looking for if that kind of makes sense so i'm saying like we want to know what type of downloads these are so instead of just having downloads we change downloads into like non-paying downloads and paying downloads paid downloads for every single row so obviously they're only gonna have it's always gonna be some value or zero so like if you had six paid downloads you're gonna obviously have zero unpaid downloads but i think that this should give us an intermediary data frame that's easier to work with so that's why i'm thinking to do that so i'm going to call this so how would we do this we're going to set everything to zero here and the paid downloads and unpaid downloads and then we can change them based on some filter conditions so paid equals zero and we could actually just do df like this and that will initialize those two columns to zero if i'm not mistaken let's just check okay that didn't work so we'll split them into two lines so we see we have these two initialized to zero columns here and now what do we want to do well basically we can take this and filter it so we filter based on if it's paid or not so paying customer equals equals no well if that's the case then we want to set the unpaid exactly equal to the downloads i'm wondering if this will work we'll see let's just sanity check do we have any values over here hope that didn't seem to work i'm just wondering if we filter this if it will change the original object is it a mutable type here well we already have that so we want to set for those spots downloads or unpaid equal to df-filtered downloads we're still not getting any values here so why is that all right so i'm actually going to do this in a slightly different way instead of doing setting these to zero let's actually set them to the downloads column and then kind of go work the the reverse way work back and so how we're going to do that is we're going to say dfall.loc and if dfall paying customer is equal equal to no well then we want to set the column paid equal to zero and on the flip side of that and i think we'll have to set this back to df all set this back to df all if it was equal to yes well then we want unpaid to be zero so that should kind of fix the issue we were running into let's look at dfall.head line 11 issue oh i have like trying to do too much and i'm making simple mistakes oh man oh man i'm making very silly mistakes with the loc function we need to use syntax like this okay that looks good where yes we see paid is proper and it's set the right way so that is looking good to me so there's definitely ways to do it the other way i was approaching it where we set things to zero then we filled in the gaps but it just ends up being easier to like fill in the gaps with a constant number zero when we filter on a condition then try to like filter and then fill in those specific values from your full data frame definitely can do but it's a little hard to walk through a little harder to explain so i approached it this way instead and now what we want to do is we want to group by so we could do a like a date format i might call this it's equal to [Music] this was a we used this in our last example but i might just do percent y percent m percent day in case the timing has any impact on this actually it shouldn't because they're all zeros it looks like let's look yeah it's always a zero time frame so it's not actually important we do that and it might screw up the uh the final uh output value in comparisons all right dfall that group by the date then we want to sum the values we'll reset the index and then we really just need to take date paid and unpaid daily values i'll call this let's look at what daily values looks like that looks good and then the other condition was the output should be sorted by earliest date first which i think it is naturally which is nice but we only want to include records where non-paying customers have more downloads than paying customers so to do that we could do like final result equals daily values filter daily values by only the condition where paid is less than unpaid final result that looks good to me let's check the solution let's go nice work all right cool i think that's all the coding questions we're gonna do so let's kind of go back we can go to some non-coding questions now and we'll finish off the video with a few of these in interview type settings you're gonna see all sorts of different questions for data analysts data science type roles you might see some probability questions you might see some statistics questions you might get some kind of high level thinking questions so we'll do a variety of different types of questions so check the description of this video i'll leave some links there as a first question we can do this problem called credit card activity and it just says how can you identify where a person lives based solely on their credit card activity and so i'm going to pause right here try to think about the solution to this on your own this is an easy level problem from visa and then when you're ready to hear my answer resume okay so this is a pretty straightforward question i would say but you want to make sure in an interview type setting that you're not jumping to conclusions so there's a couple clarifying questions you'd probably want to ask before you just gave a solution to this type of question i mean at a high level if someone lives in a certain area you're going to see on their credit card statement purchases that are around that area right so you probably could basically take a circle around where they actually live and all their purchases will kind of be in there so you take maybe like the averages of their latitude and longitude and like that's kind of your your position that they probably live so clarifying questions you would want to ask in an interview type setting is how like precise should we try to be and that's going to really vary person to person but the other question that is really important is what type of time window do we have like how much credit card activity do we have for this person because a person might be on vacation they might be on vacation for a month and they're going to be making purchases you know all around their vacation location and so is that where they live like i guess it is for vacation but it's not their home so we need a big enough window where maybe they go on multiple vacations but they kind of always come back to that same home base so i think that's like kind of the crucial clarifying thing here is like okay how much window do we have and then let's say if you have 12 months well if they spend six months and you see all points around a certain location six months of the time it's safe to assume that that's where they live and you can kind of do some averaging of latitudes and longitudes and you're not going to have that precisely from their credit card activity but you could reverse search you know business locations that they shop at and kind of pinpoint on a map those locations and then try to get some precise value so that's how i would approach it and the why i say don't jump to conclusions is because yeah they might be traveling so just that's an important thing that i would want someone to think about before they just shot into an answer in an interview next let's do a ibm question easy this is going to be a statistics question okay what methods would you use to detect outliers in the data set at a high level the you know this is a fundamental component of what we do as data analysts data scientists so this is crucial that you you know have a sense of this and can kind of speak on behalf of this for different types of data at a very like high level the way that i would approach this in an interview is first off say the first step that i do when i am working with an unfamiliar data set is plotted out so this might be a matter of plotting you know coordinates in an xy plane it might be grouping up coordinates and bidding them and doing some sort of a bar chart or box plot to get a feel of of the data and then you know from that you can kind of visually already see some outliers if you have points that are lining up in a linear format you know and you calculate like a linear regression on those points and it's you know a pretty tightly fit line but then you have one point that's like way up you know in no man's land then that definitely is an outlier and you can kind of see that visually so visually is like the first approach that i would take but then you also want to get more mathematical with your answer and you know the the interviewer is probably going to like follow up on different components but another thing that i might talk about is you know often times in data sets we see normal distributions so we see a distribution that kind of looks like this my arms are the the tails and you know we have the bulk of our mass in that central area most of the points fall in that space so mathematically what an outlier in this case would be is you calculate the mean of that and then you calculate the standard deviation and points that let's say fall two standard deviations away from the mean those would be outliers uh you could even go as far as talking about like z distributions and like z tables this is kind of a more concrete way to look at normal distributions so to kind of see what i mean about that like i'll just kind of share this this chart you know like two standard db deviations off uh the mean is you know two only two percent of the graph like you can kind of be mathematical talking about normal distributions and the like so that's kind of how i'd approach that question this is the type of question where you know they might follow up with you in a specific manner and might ask specific things like okay we're talking about linear regression now how would you calculate an outlier and same type of thing let's say we calculate a line you can calculate how far away it is from that projected line using mathematical formulas and that's how i'd approach this at a high level moving onward let's do a probability question okay so this question is called probability of having a sister and it's a medium difficulty question from google and so we're given the following assume the distribution of children per family as is given in the table below so the number of children so like this means right here is that there's a 30 chance of having zero children in a family there's 25 chance of having one child in the family and so on so consider a random girl in the population of children what is the probability that she has a sister okay so there's one i think assumption you'd want to clarify with an interviewer right away is is it a 50 50 chance that you have a male or you have a female you know brother or sister because that's kind of an important clarifying component because it doesn't say that they have a sibling it says specifically sister and this is distribution of children so that's like a key thing to keep in mind so how i would think of this is first off what's the probability that they have x number of siblings first calculate that and or i guess let's say what's the probability that they have x number of people x number of children in the family and then given they have x number of children in the family what's the probability that one of those children is a sister so this is like classic like bayesian type stuff so i'll kind of type this up as i go so what we're looking for is probability [Music] probability of like sister and so that's going to equal to the probability that you have a sister given you have n children equals zero which would be zero so we wouldn't even need to consider that case plus the probability that you have a sister given that number of children equals one plus and so on so i'll just fast forward as i type this in okay so these are all the probabilities that we want to consider here and so let's start filling in and calculating these probabilities and the reason that this works is because this captures the entire space of probability it captures zero children all the way up to five or greater so let's use the table to help us out here and also just kind of use logical thinking to fill in some of these values well if there's zero children in the family then this is obviously going to be zero because you couldn't have a sister in that case so we could just eliminate this or we could just say it's zero so i'll just eliminate it for sake of simplicity uh probability of sister if the number of children is equal to one well if the number of children is equal to one and we're talking about a specific random girl in the population of children well she can't have any siblings because there's only one child in her family so this is also equal to zero so we're eliminating things as we go here i'm going to jump a little bit farther forward and consider this case we see here in our probability table that the probability of them being greater than or equal to 5 is 0 down here so there's no chance that there is even greater than or equal to five children so this is also zero for a different reason so now we're actually just left with three cases so we just need to calculate these probabilities all right well i'm going to write this on a new line oh no i don't want to go there well probability sister if the number of children is equal to two first off we have a probability of number of children equals 2 that equals 0.25 so that's going to be multiplied by whatever we have here because we need to kind of hit this remaining space and then let's think about specifically if there's two children what is the probability that you have a sister well that's just going to be the probability that the remaining one child is a girl so that's just going to be one half because we're assuming if the interviewer gives us the okay the assumption is it's 50 chance boy 50 chance girl plus we're gonna do probability number of children equals three times well if there's three children then you have two siblings you have two chances to have a sister so the probability that at least one of those is a sister is going to be the inverse i forget we call it the complement not compliment basically it's going to be one minus the probability that both are boys because if both are not boys then the leftover probability is what you get for probability at least one of those girls is or one of those children is a a a girl so the probability that both is boys is going to be one-half times one-half so you're going to have to hit on both of those conditions so it'd be one-fourth so the overall probability that it's going to be a girl is going to be 1 minus 1 4 because that's the inverse the kind of the complement of the 1 4 1 4 being probability of two boys so this is going to be three-fourths then plus probability of number of children equals four times well if there's four children in the family then you have three siblings so one half times one half times one half is the probability that all three of those siblings are boys so that's one eighth so it's going to be 1 minus 1 8 which is also 7 8. and now we can start filling in the these values from our table first value here is children 2 is equal to 0.2 so 0.2 times 1 half plus 0.15 times 3 4 plus 0.1 times 7 8. and so it might help to make these all fractions so this would be like two-tenths this would be you could say three twentieths right because five times five yeah that works and this would be uh one-tenth so this will just make our calculation easier so that equals now two twentieths plus 9 100 or 9 80ths plus 7 80ths so what does that give us well the first one is equal to 8 80ths because we multiplied by 4 plus 9 80ths plus 7 80ths and that finally is equal to [Music] 16 24 24 80ths so if we reduce that down eight goes into 24 three times so that's three tenths so we get three tenths which is equal to 0.3 which is equal to 30 chance or 0.3 so that would be my posted solution and let's see the solutions from users look at that 30 percent 0.3 30 we did it nice i love probability it's fun uh if you need a good book to learn probability uh i really liked the textbook that i used in university so that is this book right here i'll link to it in the description i'll share like the amazon link but i thought it it covers the core principles of probability well but it's not too massive of a book and it has nice visuals and everything great book i love probability nice work on that problem all right let's do another problem this one's called uber black rides so if you're not familiar with uber ride sharing service if you're not familiar specifically with uber black well those are kind of the luxury cars that you can call and and you know get to your uh destination and fashion so if you're trying to ball out on a hot date you know uber black might be the way to go i i never ball out like that but that's just me but the question here says uber black rides have dropped 10 percent how would you investigate this reduction so a question like this is so so important in the data analyst data science world because it's not straightforward how you solve this problem there's many different paths so in an interview type setting what the interviewer is looking for is what ways do you think about a problem can you ask the right questions to kind of dive in and do useful meaning or full work to discover something about the business so this is like a very very real world so very important to understand and do a lot of these problems if you have someone that you can do these types of interview questions with definitely do some like mock interviews where you kind of go back and forth but how would i think about this problem and feel free to pause and try it on your own before i go into it alright so there's many different things that i'd be thinking about here you know you want to consider you know is it specifically uber black that's down if all uber rides are down then you know and they're all down let's say 10 then you know it makes sense that uh uber black would also be down you know a similar metric if there's just not as many drivers out i think then the other side of things is it i'm going to say specifically uber block this down versus all of uber i also would consider like so this is very important just like kind of drilling down on that like is it relative or is it you know a universal thing that rides or down other questions i would investigate is like you know what year are we talking about if we're talking about these past two years with the past three years with like the pandemic you know maybe it's down because uh it's down because things are businesses are closed so it just there's not as much of a need to take those uber blacks specifically like pandemic would be an interesting focus because you know maybe you still need to get places but maybe they're not fancy bars fancy restaurants where you need to drive in style so you know the pandemic may have an influence on this uh other things to consider here did anything change about uber black so any specific changes to uber black and so when i say this i'm thinking about things like did they increase or decrease the vehicles that qualify if they started being more strict if business said okay you know this certain year can no longer like instead of having cars that are within 10 years of being produced now we need to only have cars that have been made in the last five years something like that like if there's a increased strictness then you know you might naturally have less cars on the road other things to consider is like price price increases or decreases uh recently uh on the price increase side of things if they become too expensive then people aren't gonna book them on the other side of things you know these are nice cars that the drivers are driving if they're not making that much to use uber black then it might not be worth them to even get on the road they might be working for lyft or another service or maybe you know do private contracting so think about these things as well so i would kind of book all of this stuff that i just kind of mentioned under the umbrella of topics to investigate and i would lay this out with the interviewer you know ask clarifying questions ask what assumptions that you can make maybe ask questions like hey how how much were total rides up or down during the same time period so you when i say year here specifically i might just put in parentheses time period you kind of want to have that start and end but ask these clarifying questions on you know how other factors influenced because we don't have the data because this is a non-coding question and you'd be kind of asked this in a person-to-person face-to-face manner you know you're not expected to have exact approaches exact details because we don't have the data but the other thing you would want to mention in this type of interview question is possible testing slash investigation approaches how if you did have data how would you quantify these things how would you see if your hypotheses were right or wrong uh well there's a few things that i would look at is can we do back testing on our data so you know looking back 10 years in history you know what is the general trend of uber back rides you know maybe they're up in the winter months when it's cold but they're down in the summer months when you can just walk to a destination so i would look at our old data and see if i can learn anything from that that would suggest you know which of our hypotheses might be true or untrue i would ask for data on you know prices price changes i would ask for data on when any policies regarding so data on on prices you know global increases slash decreases etc i would ask for data on uber black qualifications if you specifically have a timeline of events like okay we let these cars in we let these cars in we let these cars in and then like at some point that was reduced then that could be a very much a key indicator um you know plotting both those events along with the overall number of uber black rides could help you really see some sort of uh relationship and correlation between the two factors what other ways that i would possibly test this you know i definitely would be i'd definitely be plotting the uber black rides compared to other types of rides and cumulative rides this would help see if i could find any sort of trend where maybe uber blacks were down but the total number of rides increased and the total number of regular uber rides increased like these are the types of things where you can start piecing together the story you you know you start piecing together these different components and that will ultimately help you kind of start telling a story about the data you know also be chatting with you know business folks business people in a team because they might have information that you wouldn't just realize from the data alone so definitely work closely with the business people as well but overall i think like these points that i kind of hit on you know these are the types of things that an interviewer would like you to be discussing so really it's just there's no right or wrong answer it's it's how are you thinking about this are you considering different unique cases and situations and then can you take those and run with it they might ask you specific follow-up questions so just kind of be prepared to back up what you're saying hopefully this all makes sense all right let's do at least one more question might do two it's been pretty long video though so i don't want to get too too crazy all right this next question is considered a hard problem and it is from capital one so the question is how would you approach the task of processing a terabyte of data using a python script well this is like right up at least my personal wheelhouse so i have a lot of thoughts on this but if you're not as familiar with python it might be a little bit harder but i think just really try to hit on some key components once again you're not actually writing code for this so you can't you know actually implement this but i think it's a matter of are you thinking about the problem in the correct manner so how i would think about this i don't want strikethrough is there's a few different things to really process a lot of data in our script and you know it's one thing to process that data it's also another thing to process that in a time efficient manner and it depends on what type of data that we're working with if we're working with image data it's going to be different than if we're working with a data frame a csv maybe of a lot of rows of information so clarifying questions to ask what type of data is this whether it be like image csv [Music] etc that'd be a large csv if it was a terabyte but is it like a data frame type object uh what is the goal that we're trying to accomplish with this data if we need all of this data intact in the same format then it's going to be different than if we're trying to like aggregate and reduce the overall size by processing means i think that's kind of two good clarifying questions to ask it's really hard to go much much deeper than that unless you have some answers from your interviewer but then you know let's think about this so approaches if we are given an image we're going to probably want to like down sample downsize images that's like just a high level image topic but i'm gonna go even actually higher level than that to start if we want to do this fast we want to you know maybe i'll add one more clarifying question what type of machine are we working on slash with machine slash machines if we can distribute this to many different computers you know that's important to know it's also important to clarify is speed like what what is the most important factor what are most critical factors like speed size reduction etc like if we don't care about speed as much then you know we're going to take a different approach than and also i might even say like simplicity of the code and when i say size reduction i mean like size reduction of the data set simplicity of the code because usually you have a trade-off here if you want your code to be super super optimized and fast we would write more complex python code to optimize that speed but if we didn't care and we only had to run this once ever then it could be fine if it took you know 12 hours if we had simple nice code to use so that's a trade-off to consider these are good things to be thinking about to be talking with your interviewer and just show them that you're not jumping into this too quickly that's like a recurring theme with any of these problems okay so approaches well given we don't know what these tradeoffs are i'm gonna just list some kind of high level things we're gonna probably have to stream the data so we can't i'm going to say stream the data into memory so specifically if you want a specific example in python pandas if you were trying to process a terabyte of data you would break it into smaller chunks so you might break it up into chunks of like 200 megabytes let's say like 200 000 rows something that more easily fits into memory that's how you're going to do it and you can't be loading it all into memory at once you're going to be taking that small amount processing it and then going next another very crucial thing when you're trying to speed up code is like in in process a large amount is parallelizing i can never spell this word making the most of your machine if you have 16 cores then you want to be running threads of processing on every one of those you know threads that you can so if you have 16 cores or eight cores to run you know if you have eight cores to run your code you're gonna be able to run eight threads in parallel at one time you wanna make the most of that within your python code you also could then if you have multiple machines you could distribute this on you could split up that data and you know send a certain amount of the data to each one of the machines so distribute among multiple machines and this is all based on the assumptions that we are given by our interviewer let's say we only had one single laptop then you know the the streaming the data into memory and the python pandas of like small chunks along with paralyzing is the approach but if we have 10 machines to work with let's use all of them another thing to consider in a problem like this is does you know the the top piece of memory so like if we're processing the first gigabyte of memory does that have any sort of meaningful relationship with the bottom gigabyte of of information like if these are massive massive images then you might need to process them kind of together so another kind of thing to clarify all right other things to think about with this task i mean i think that i mean we cover a lot of things here you might be able to get into more specifics within like the parallel processing i think spewing as much information as you can is always nice like uh one thing that will be important is like do you need to lock certain variables so if like one thread is processing a certain piece of memory that the other another thread depends on do you have locks on your parallel codes you don't want to run into what is known as you don't want to run into race conditions and these are other important things to consider you know so think about this as well if you're curious about this specific component like how do you process large data files with python pandas i actually did a video about this on my second channel so i'll link that in the description and also pop it up either here or here but i think that this is a pretty this covers a lot of the key aspects in a question like this so hopefully you kind of have a sense of how you would approach it from hearing me kind of ramble on about this type of thing you might you know because it said specifically python script i kind of stayed kind of vanilla python but you might also be able to talk about different libraries frameworks other tools that work in the python ecosystem that are very helpful you also might talk about like databases and where you're saving this data those are other things to consider so a lot of clarifying questions in this type of task but hopefully this kind of helps you understand kind of the key components all right i think that's all we're going to cover in this video hopefully you all found this helpful if you did be sure to throw this video a big thumbs up and also subscribe if you haven't already if you want to see more videos like this please let me know in the comments you also can check out my second channel tech trek by keith galley where i've posted a few more examples once again i want to say thank you to brilliant for sponsoring this video if you want to learn data structures algorithms and other computer science topics be sure to check out the link at the top of the description brilliant.org keith galley and sign up for free and potentially get 20 off if you want to stay up to date on everything that i'm up to be sure to check out my instagram and twitter at keith galley i've been posting a lot of fun pictures on instagram recently so i'd love to have you all join along for that all right i think that's all i have so until next time peace
Info
Channel: Keith Galli
Views: 109,278
Rating: undefined out of 5
Keywords: Keith Galli, python, programming, python 3, data science, data analysis, python programming, faang, interview questions, data science interview, data analyst, become a data scientist, real world data science, stratascratch, python pandas, pandas, regular expressions, regex, dataframes, dataframe, groupby, learn to code, data, data engineering, faang interview questions, tech interview, tech, data science practice, machine learning, AI, ML, google interview, facebook interview, amazon
Id: cc0HOiKN_ac
Channel Id: undefined
Length: 107min 50sec (6470 seconds)
Published: Tue Jul 26 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.