40 Real Data Architect Interview Questions & Answers - Part I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome to the first of the four part data architecture interview series this video will focus on 10 generic data architecture questions please note that this series does not contain any theoretical questions this series comprise of real questions that are actually asked in the corporate interviews no good interviewer generally interested in getting the right candidate would ask you theoretical textbook questions additionally i will also be telling you what is the interviewer's intent in asking the particular question and how to tackle it in the right way so let's get started our very first question is how do you improve an existing data architecture the intent of asking this question is to understand your maturity level as a date architect and level set whether you are able to find pain points and propose right solutions with the help of this question you are being tested if you are able to think on strategic lines beyond designing and data modeling so what is the best way to answer this question the best way is to get the basics first how would i answer this i would start with interviewing stakeholders usually system users to understand their pain points followed by documenting their problems and gaps and finally proposing solution for each problem these three are pillars of any improvement project but this answer alone won't suffice one has to tie this to data architecture you can now demonstrate your knowledge to the interviewer by stating examples of actual scenarios that you have encountered i will list some of the examples that i have encountered myself say if a pain point is that the users are unable to trust the data the problem could be of bad data which can be proved with the help of negatively trending data quality metric what's the solution that you can propose in the scenario to enhance the architecture fixing data issues and strengthening your data quality checks additionally you can propose revisiting data governance strategy since it's still causing bad data next if a pain point is that the users or downstream systems don't receive your data timely then the system may be dealing with processing delays consider real-time processing only if there is use case fitments or considering modernizing etl or technology stack i came across a scenario with the u.s based case management company while evaluating their data architecture who still had their entire etl set up where stored procedures and it was not able to handle the load hence modernizing their etl stack was a good approach for them next if a pain point is that it takes forever for engineering teams to add a new data set then most likely your data model is not flexible enough this can be fixed by either redesigning your data model to make it more flexible or to consider a no sql schema or database only after considering all the use cases finally if users are unable to make informed or quick decisions just by querying your database then you are probably missing an effective bi component consider data visualization please note this is usually first project assigned to majority of newly recruited data architects it's difficult to get a new system designed for new hire hence the improvement project lets you prove yourself and make a mark that's the reason this question becomes very important next question is how do you choose data lake versus data warehouse for your solution this is first of many how do you choose type of questions that you will come across in this video the intent of asking this particular type of question is to understand whether or not you are able to understand the subtleties between different types of data systems and are able to take a decision on which one to choose so the best way to answer this question is to focus on scenarios without getting into explaining what is a data lake or what is a data warehouse keep it crisp and to the point so how would i answer this i would choose a data lake in the following scenarios when there is no defined use case and the usage basically depends upon the creativity of the user this means interest of sense data is oil and exploration can lead to more and more value and roi next for scenarios where speed to innovate is crucial and it can provide you with a competitive advantage another scenario is for super complex analytical use cases where a data warehouse might not work these are typically next-gen use cases based on artificial intelligence or internet of things and etc another scenario is when system is expected to scale a lot that is many source systems across diverse data types needs to be integrated quickly and finally most important scenario is when semi-structured or unstructured data is prominent on the other hand i would choose a data warehouse if there are proper predefined use cases if data governance aspects are more important due to various reasons if my primary user base is line of business users or business analysts and finally structured data is more prominent moving on next question is can you explain different types of data stores used in organizations okay the intent of the interviewer here is to check if you are able to identify or classify a type of system appropriately during your data architecture work so we can tactfully answer this question to show our knowledge on various systems that we know and where exactly do they fall in a data supply chain let's start with two of not so famous ones normalized data store also known as nds which is a data store built using normalized data modelling and dimensional data store or simply dds which is built using dimensional modeling data mart is a popular store which is considered as a subset of a data warehouse and is domain or subject oriented that is caters to a single business it is usually a dimensional store and is at times also known as dimensional data mart now these three systems can be a data warehouse so here you have explained the interview the types of systems that are typically used as a data warehouse now there are some other systems as well olap is an analytical database which is used for specific purpose of complex analytical calculations for an even complex analytics needing multiple dimensions a multi-dimensional system also known as mdb or simply a data cube can be used it is primarily used for business intelligence maths and engineering purposes these systems can not be a data warehouse in fact these are outputs of a data warehouse moving on to the origination side of things we have what is called an oldp system which records transactions it's usually a data capture system capable of recording number of concurrent transactions in addition we can also have an operational data store or an ods which contains latest data from multiple sources typically used for reporting these type of systems are usually systems of records now few organizations can consider ods as can be a data warehouse most organizations don't for learning purposes we will also not be considering ods as a can be a data warehouse simply because an ods doesn't store history and by our definition of a warehouse a data warehouse must store history next up is how do you choose a relational versus non-relational database for your solution okay this is one of the most asked question the intent here is to check as an architect if you are able to make a design decision when it comes to the most important aspect whether to choose rdbms or nosql again the best way to answer such questions is to get into the scenarios right away without getting into definitions or theory one would choose a relational database when structured data is more prominent when data integrity and consistency is important and the system usage can be predicted and is not volatile okay on the other hand one would choose a non-relational database when some semi-structured or unstructured data is more prominent flexible data model is desired or the biggest decision making factor scalability requirement is the key this is because relational database won't scale beyond the point the next question is how do you choose patch versus real-time processing for your solution another how do you choose type of question don't worry final one in this part these are two data integration patterns widely used in data systems well the answer to this majorly depends upon use cases along with some other factors one would choose batch for use cases that don't require processing to be done urgently like bill generation end of the month reporting payroll generation etc additionally if large volumes would need to be processed efficiently batch works best as compared to real time finally since batch is cost effective if cost is a factor go for batch on the on the contrary one would choose real-time processing for cases like atm real-time insights etc if speed to process the data is important for your business plan and can make a breakthrough then go for real time since it's expensive consider only if you have real business benefit in next two questions we will focus on fundamental concepts of database engineering the first question being where all would you use an index another follow-up question or another way of asking question this question could be when does indexing fail okay the intent of asking this and the next question is actually to understand if your fundamentals are clear with modern data warehousing and big data engineering many people are beginning to lose touch with basics of databases please note that these concepts should be foundation of any data architect but is often neglected so to answer this as you are aware using too many indexes can be more harmful than not using indexes at all hence choosing the right areas to put the indexes is critical these areas can be columns in where and join clauses of your important and frequently used queries if a column is repeatedly appearing in your most important queries it's one of a good candidate for indexing finally it's all about choosing the right trade-off would you want to improve querying response times at an expense of overall system performance on the other hand answering when would indexing fail as discussed before having too many indexes would cause indexing to fail if indexed columns are not queried much this aids to indexing overhead thereby reducing overall system performance and finally if the columns with indexes have too many frequent inserts updates or delete operations not providing enough time for indexing to actually happen continuing with another foundational question how do you improve your databases performance this used to be one of our favorite question of many interviewers few years back and for some it still is so how do we answer this let's start with the classic trio cpu memory and disk space checks additionally there could be database locks or deadlocks being formed reducing the overall performance levels another good way to increase your database performance is to tune or optimize your queries and by revisiting indexes an index may not be relevant over time due to changing business or user requirement queries next you can run update stats which might help if nothing is helping it's time to see if you are on the latest version of database or not if not upgrade to the latest stable one next one up is a modern big data question when would you choose a graph database this question was asked to me by a large investment bank the intent of asking this question is to understand if you have the ability to connect with modern ways of data engineering and if yes how deep so one would typically choose a graph database if you have hierarchical data and multiple use cases requiring multiple self joints too many cell joints would slow down the system hence a graph storage is recommended for such type of data second if your data is so highly interlinked that it's difficult to find patterns with just sql queries graph databases comes to rescue third if your use cases require retrieving or traversing extensively typically queries will fail here hence a graph database is recommended you can also quote some examples that you have worked with or have heard of being used in your organizations graph databases are typically used in financial crimes to detect money laundering and also used in data linear system and recommendation engines next two questions are based on a very important subject latest trends which is a must topic for any architecture interview first one we have is have you come across the term data lake house what is it so this is a very specific question on a later stream the interviewer may randomly ask you any such questions that you may or may not have heard in order to answer these type of questions develop a regular habit of spending just 10 minutes per week to read latest trends in the world of data and or technology answering this particular question as the name suggests data lake house picks up advantages of data warehouse and data lake and combines them into one architecture it picks up scalability flexibility of data lakes and picks up asset properties of data warehouse hence use cases from both areas like ai ml iot from data lake and analytics business intelligence from data warehouse can be solved using a single data lake house architecture and finally we have come to our very last question in this first part what's trending in the world of data these days architects show organizations the future with their vision hence architects are expected to follow latest trends in their respective areas for us it means data that's what makes it very important question how do you answer this well one or two days before the interview google this question pick up one or two latest trends of your interest and do a detailed research the interview would be judging you on number one if you are able to keep track of latest technology trends in your data space and number two how well do you understand them or if you have clarity in thought process regarding those trends when asked this question instead of listing down four five or six trends just list down one or two and expect a follow-up question like what are your thoughts about it or do you think that trend will succeed this is where your research will help so at the time of making this video the following topics were the latest trending ones of course these trends will change over the time first we have data as a service which enables sharing of data over cloud next auto ml or tinyable automl is automated machine learning tiny ml deals with models that you can execute on small devices climate analytics with many countries pledging to be carbon neutral in coming few decades climate and carbon emission analytics is going to pick up and finally data observability simply put tracking health of data even simply put by today's definition it's a combination of data governance areas with monitoring so that's all in this video all the very best for your interviews we have three more parts focusing exclusively on data modeling data warehousing slash etl and data governance the links are in the description if you would like to see more such interview series then let me know in the comments box which one if you like this video and had a good learning experience then do check out our other videos do like and share also subscribe the channel for latest videos and trends in the world of technology and architecture see you in the next video
Info
Channel: Software Architecture Academy
Views: 12,434
Rating: undefined out of 5
Keywords: Data Architecture Interview, Data Architect Interview, data architecture, data architect, data lake, NoSQL, Graph Database, Data Lakehouse, big data, database management, big data architecture, architect, architecture, batch vs real time, solution architect interview questions, database architecture questions, azure architect interview questions, data architect interview questions, database architect interview questions, architect interview, cloud architect interview questions
Id: 9ToVk0Fgsz0
Channel Id: undefined
Length: 20min 2sec (1202 seconds)
Published: Sat Mar 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.