Data Scientist Answers Interview Questions

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone AJ Hathor here and thank you all so much for 20,000 subscribers to commemorate let's go through 100 data science interview questions were some of the main steps for data wrangling and data cleaning applied before machine learning so before anything else it's important to take a step back and talk to your stakeholders understand what problem they want you to solve and sometimes they don't even know what you can do with the data so it's your job to also communicate what you can do with your data once you have a goal in mind and you know exactly what your stakeholders want you do some exploratory data analysis EDA this involves swimming through your data and picking apart each and every kind of indicator that we that would be indicative of the outcome objective you can throw up a bunch of plots you can create a small linear model where you have a bunch of independent variables that you think would be useful and your dependent variable and see how they correlate with each other although it looks like your training and model it is still a very large part of exploratory data analysis for you to understand the relationships between variables your model is really only going to be as good as your data so you want to make sure that you spend enough time doing some EDA how do you deal with unbalanced binary classification this kind of depends on your problem I would typically go for an under sampling approach but you also have over sampling approaches to what is the difference between a box plan histogram how they look but also I actually prefer using box plots simply because you can represent a large amount of data just in a small space without overlapping unlike histograms where well you could represent multiple histograms by overlapping them but it becomes pretty crowded for example if you're asked to plot out the total number of orders that every person has made in every single month then you can have 12 distributions in each distribution being a box plot over the month and they can all just be like next to each other as opposed to like 12 histograms that are overlapping each other are just difficult to see altogether but to one's own they're both pretty good visual tools describe the different types of regularization methods such as l1 and l2 l1 regularization axis feature selection l2 progression also penalizes coefficients but not to the extent of zeroing them out what is cross-validation in modeling and specifically supervised learning you would need to estimate the parameters of a model which is done through training however there are other parameters that you wouldn't need to manually set these are hyper parameters and to understand what those values are you can use cross-validation techniques I would typically do like a grid search of hyper parameters so a grid search would essentially try to map out different types of values within certain ranges and it will choose the set of hyper parameters that gives you the optimal minimum grid search takes time but you can also do this manually and try to figure out which hyper parameters work best to minimize loss how do you define a set of metrics a very business-oriented question of course we can use certain metrics such as precision recall and accuracy but these are generic metrics that are used in general for solving machine learning problems but as you get into a specific business there are certain things that you would want to optimize it could be a function of precision and recall or something completely different to you would try to pick your features model a problem according to whatever optimization or whatever you need to optimize explain what precision and recall are precision of the number of people who you said Gotha fires how many of them actually got the virus recall of the total number of people who got the virus how many of them did you say got the virus explain what false-positives and false-negatives are why is it important what's the difference between them provide examples when false positives are more important than false negatives and false negatives are more important than false positives somebody who doesn't have the virus but your model says they have the virus that's a false positive false negatives a person has the virus but your model says they don't have the virus which one is more important depends on the situation in this virus example if a person has the virus and they are flagged as not having the virus that can be extremely dangerous in that situation of false negative is much worse in a court of law where we have two cases the positive class being guilty and the negative class being innocent a person who is innocent is marked as guilty that is worse than a guilty person being flagged as innocence what is the difference between supervised and unsupervised learning give concrete examples supervised learning you have training examples you have labels and you train a model accordingly giving them a question and the answer support vector machines can ends random forests unsupervised learning you only have examples with no labels it would need to learn itself on certain interactions and patterns that happen within the data clustering example of unsupervised learning why would you use a random forest versus SVM well typically I would actually make use of a random forest classifier over SVM in general since random forests are more interpretable and when you're actually working on a business problem you need to make sure that your variables are always interprete Bellaire needs to be a direct relationship between your input variables and also your output label without that you're building a model but you can't really assess the importance of one feature over the other that kind of becomes pointless I would also use extreme gradient boosting of like because it's just much faster than an SVM SVM with its kernelization can become pretty convoluted why is dimensionality reduction important I would typically use it to bring down a large number of features into about two or three dimensions so that you can visualize what's going on however it is important to note that they can lose their meaning and interpreting those features may not be the same as you probably just looking at the numbers yourself but in a lot of cases it really does help in visualization why is naive base so bad and how can you improve spam detection algorithms that use the naive Bayes algorithm naive Bayes is considered naive because of its condition of independence so that means that for an email that you have and you're building the spam detector in a given email a spam classifier using naive bayes will basically treat every word as independent of each other so every word is independent of every other word which we know is not how English works English is a bunch of grammar and tokens that just interact with each other so it oversimplifies the problem how would you improve the spam detection algorithm that uses naive bayes they say you candy coral eight the features so that the assumptions hold true but I just wouldn't use naive Bayes for this what are the drawbacks of a linear model well I guess the biggest one is simplicity you won't be able to capture the patterns seen in a very complex problem using something like linear or logistic regression but I will say that they are good for doing some initial EDA and trying to understand how certain features are correlated with each other do you think that 50 decision trees are better than one large one why or why not well this depends so let's say that we have a decision tree and you train it on 100 samples right and we have one test sample that we just feed it in and you get the output let's say that we keep randomizing the inputs and just taking that same test value but we have a different set of 100 inputs every single time that we want to train a decision tree now for the first time you get one value for the test sample for the second time you get another value and this value could be jumping around a lot so in other words it has a very high variance so in order to decrease that variance what you would do is you have 50 decision trees then you take the test sample and you feed it to all of them you get 50 outputs take the average of it and that's your final answer this value will have a much lower variance than just using a single decision tree so it could lead to higher performance but on the off side let's say that you have a decision tree that let's say hypothetically this decision tree doesn't give you a very high variance in values so even if you were to train it on different sets of samples that test point it would kind of not have a high variance in itself in this case there's really no need to use 50 predictors because all 50 predictors are just going to predict the same thing if you're doing it once you're doing it 50 times it's just slower and you're getting the same answer anyways why is the mean square error a bad measure of model performance and what do you suggest instead the mean squared error is not very robust to outliers for something like this you can use the absolute loss which will ignore outliers altogether but there are some cases where if you have like two cohorts of data points even the L 1 or the L 2 loss is not really going to work in your favor but you can use something called the pseudo Hueber loss which is a combination of both of those losses what is collinearity and what do you do to deal with it how do you remove multicollinearity it is pretty common that during machine learning two of your predictors that you are using are very very highly correlated now the effects of these on much larger models are actually not too bad like you can have multicollinearity and your model will still give good results however it is good practice to try to minimize that as much as possible if you have two predictors that are kind of doing the same thing you can remove one of them in most modeling techniques using two very cool linear features doesn't necessarily destroy your model unless it's like some linear regression to remove multicollinearity is pretty interesting because in a business setting all your input features are more than likely tangible things like you know that they have some meaning so I would try to create different charts of what your data represents each of those features perhaps see their distributions see how they vary over time if you do see that they have very similar patterns throughout then you might as well omit one of them what is random forest and why is it good random forest forced a collection of decision trees that takes an average of an output of multiple decision trees for its overall combined output and this is good because it reduces the variance of your output like we've talked about before what is a kernel and explain the kernel trick well I tend to shy away from these kind of questions because they're not very representative of what you would actually be asked in an interview in fact I don't really use svms much at all at work I tend to use more of a variant of gradient boosting because it's just a lot faster and your features are also more interpretable which is more important than anything what is overfitting when your model starts memorizing instead of generalizing what is boosting now boosting is morph a concept and not just necessarily apply to decision trees the idea is that they combine a bunch of weak learners in order to make a strong learner I have an entire video on this with a cool story so I think you should watch it and that's it I hope you enjoyed this little data science interview stint I might do some of these in the future hopefully with a better setting thank you all so much for subscribing and watching and I will see you all in the next one [Music]

Info

Channel: CodeEmporium

Views: 5,496

Rating: undefined out of 5

Keywords: Machine Learning, Deep Learning, Data Science, Artificial Intelligence, Neural Network, interview questions, machine learning engineer interview, data science interview questions

Id: TGSBXcMwTk8

Channel Id: undefined

Length: 12min 43sec (763 seconds)

Published: Mon Mar 30 2020