Will AI Replace Data Scientists? 🤔

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] foreign [Music] this was a real test conducted in 2020 when GPT 3 Model first came out the technology since then has come a long long way and chat to beauty is an amazing example of that we are yet to see if this AI technology can actually replace therapists but as the data notes you might be wondering can chat topt and AI in general replace jobs like data analysts and data scientists so in today's video we are going to be talking about what chat tpd actually is under the hood how can we use it for data science what's good and what's not so good about using such AI system for data science so without further Ado let's get started as you might already know chat TPT is a large language model created by openai which is trained to carry on dialogue it can respond to user input in a conversation and perform numerous language tasks if not any language tasks including text summarization and language translation how do you say how are you in Vietnamese how are you can be translated too but in Vietnamese it is pretty good it can help you write code from your instruction or help you rewrite your dating profile and many many more and the best part is it didn't actually encourage me to commit suicide when I brought up this question so how did chat gbt achieve this let's go all the way back to 2017. in 2017 a paper from Google called attention is all you need first introduced a network architecture called the Transformer this network architecture is solely based on attention mechanism it took much less time to train and it outperformed the best models back then both in terms of Speed and Performance it quickly replaced the once popular recurrent neural network and convolutional neural network architectures in several deep learning tasks in 2018 the first pre-trained Transformer model gbt was introduced and within just two years several state-of-the-art Transformer model those were created and the number of parameters in those models keep growing exponentially but there are some problems serious problems with these models they don't always align very well with human expectations sometimes they don't follow user intentions sometimes they hallucinate or make up wrong facts or worse yet they can generate biased or toxic outputs this is what experts code human AI misalignment issue although chat TPT is also built on top of gbt the difference is that it also uses reinforcement learning from Human feedback with an aim to make this AI model more truthful and less toxic the model is trained and fine-tuned through a couple of steps in the supervised step AI trainers have a conversation with the AI model and provide responses to demonstrate the desired output Behavior after the model is first trained in The Next Step a sample of the model outputs were ranked from best to worst by the AI trainers and this course would be used to train a reward model that can later calculate the rewards for all the model outputs using this reward model in the last step the chat gbt model get fine-tuned using a reinforcement learning algorithm this reinforcement learning process trains model to give a better response and hence a line better with human expectations although chat TBD is one of the most powerful language models to date and I have to admit that its English is perhaps much better than mine it still has many limitations openai described them very clearly on its website besides that there are also more deep rooted concerns about using AI models as an all-knowing AI system the most important one is perhaps the question do language models actually understand meaning in language no matter how many parameters a language model has it's unclear if it actually understands definitions and Abstract Concepts a paper by a group of Stanford researchers found that pre-trained language models make mistakes at least 20 percent of the time struggling to distinguish words from the antonyms and understand abstract definitions we've also seen this lack of robustness in chat to BT when we often need to tweak the questions slightly to get the answer we want to see what chapter T brings to the table in data science area I asked chat chipti a few questions to see if we can actually replace a mediocre data scientist that's me all of the questions are based on real world situations I've encountered in my data science job the first question is I'm wondering if chat typical can give me a roadmap for learning python for data science as a beginner and I want to also have a weekly schedule and the resource links as well let's see okay we do have a roadmap here so we have week one learn basics of python though we have some resource links this is really nice so if someone wants to start with python you can definitely use chatpity too perhaps have a rough idea of what are the things that you need to learn as you can see this is a very general road map but yeah it is a very good starting point in the next question I want to ask something about statistics you work as a data scientist what are the methods to compare to distribution so if you work with data sometimes you might want to check the distribution of a sample which activity suggests us that we can do visualization plotting the distributions summary statistics we also have two sample t-tests and we have KS test we have KL Divergence and some other methods I've never heard about was sustained distance okay I don't know about this test so although I think this answer might not be complete I think it's a it's a very good summary of what kind of tests you can do moving on to the next question I want to ask a little bit more practical question how do you detect anomalies and outliers in a data set let's see what comes out it's a super General answer let me try to ask how do you detect outliers or anomalies using clustering okay at First Sight the answer sounds great it gives me four steps for detecting anomalies using clustering so firstly pre-processing the data then clustering using some sort of algorithm and then we identify outliers and then evaluate the results which sounds pretty sensible however it also goes on to say that it's important to note that clustering methods require the number of clusters to be specified in advance which is simply not true clustering methods like hierarchical clustering don't require you to specify the number of clusters in advance so this answer is a great start but it's partly wrong the next question is a coding question so I'm going to ask something like you have a customer risk data set can you write python code for visualizing customer income distribution both default status I'm really curious to try out some good here and see if it actually works so I'm gonna try this good out so I have downloaded this data set here and I have my Jupiter lab open so I can just paste the code here I'm gonna just rearrange the code a little bit here I need to import the data set first really read CSV oh I need to change the column name that would be person income and this is CBE person default on file oops we have some error Group by default I need to change this as well this is my fault okay it actually works but this looks pretty terrible you can see that there's some outliers here in the data set so we cannot really see much from this histogram so yeah to be able to get some insight from this I think we'll need to tweak the code a little bit so I asked the next question to ask chat TPT to remove the outliers from the income in the visual position so here's the updated code let me try this maybe I should just change the column names in the data set so that I don't have to change the code too much now I can run this code hopefully oh this works much better so I can see that it has defined the lower bound and the upper bound of the income column based on the 5 and 95 quantile and it only visualize the data points that are within this bounce so I think we can definitely use this code in our expiratory data analysis so as you can see it's really important to know how to ask the right question and to be able to ask the right question I think we still need to have some sort of Knowledge and Skills and expertise to be able to instruct or direct the question to the most desirable outcome in the next question I'm going to ask a machine learning question I have trained an XG boost model and I would like to explain the model output using a series of plots with sharp please write the code okay it assumed that we have trained a model here we also have a lot of data for the test data and then use the sharp library to create some plots so this sounds familiar and I think that we can actually use some of these functions we probably have to customize the plots a little bit to get the right visual but I think it's a pretty good generic code to start with so in the next question I want to ask something a bit more high level I'm curious if chat typically can help me with the Witcher network analysis project that I did earlier on my channel can you write python code for social network analysis of the characters in The Witcher book series I'm sorry I'm not able to write code for social network analysis because it would require knowledge of the characters and relationships that I don't have so it gives me a few steps first we need to collect the data it suggests me to gather information maybe by reading the books or by using online resources such as Wikipedia and we need to prepare a data create network graph using libraries such as Network X which is a very good suggestion I want to try out this code as well and it actually works we have Gerard in the center and we have the four other main characters here which is pretty cool so I believe that if I just keep asking really specific questions on how to do each of these steps I can get some decent information from this all in all I'm pretty impressed and I can see a lot of potential in using AI firstly to speed up your research and prototyping you can also use it to quickly learn how to use a new library and get help with common coding problems for example I can now create a pretty plot with much less time and focus my time and energy on more important things it can surely make our job more enjoyable and less frustrating like if I come across a function written by someone else knows that I don't understand I can ask chargpt to explain that function for me it can help me quickly documents my code which is perhaps my least favorite task also we can see that this AI model is more helpful with particular tasks rather than a whole process so yeah we're not reaching Singularity yet if I ask it to come up with a grand plan for a new project it can struggle to give me a meaningful answer but if I ask it exactly how to code this or how to do this in more detail it works wonderfully it's pretty understandable because data science is complex field that requires a combination of different skills from technical skills domain knowledge problem solving and critical thinking all of which are difficult to fully automate I think this can be a perfect collaboration between data scientists and AI we as humans will oversight the bigger picture focus on asking the right questions and making decisions while AI assists us to speed up the the process of figuring out the technical details so I think a tool like chat CBT can help us perform our job better faster and in a more enjoyable way despite these large language models becoming more and more convincing we need to be aware of a few things if we decide to use it for data science as we talked about earlier chat typical is partly trained on the inputs from the AI trainers so the model outcome is inevitably biased towards the preferences of the trainers this means we need to be critical and take the answers with a grain of Swords especially in data science we might work with sensitive topics and we are the ones who design the metrics and the models we need to be extra careful in addition some information in the answers might be incorrect as we've seen earlier in one of the questions that means we still need to have the expertise and the right foundation in statistics math and domain knowledge to be able to recognize the potential issues or silly mistakes we still need the ability to judge if the answer is good or good enough no matter how convincing it might sound also the code generated by AI is not guaranteed to be working 100 many people have pointed out that the code can also be quite inefficient even if it's not theoretically wrong so data science practitioners like us are still much responsible for selecting testing and optimizing the solutions there is no doubt that chat TPT and similar models will get more fine-tuned and become more perfect but there's another concern the phenomenon of de-skilling this killing is a process by which skilled labor within an industry or economy is eliminated by the introduction of Technologies operated by semi or unskilled workers if anyone without any skills can create AI art build machine learning models or create a great website with chat CBT we might decide not to even bother learning the basic skills in the first place this is what happened in the aviation industry commercial aircraft nowadays fly on autopilot for much of the time some people put it this way once you put pilots on automation their manual abilities degrade and the flight path awareness is dark flying becomes a martyring task an abstraction on a screen a mind-numbing wait for the next hotel so when emergency or unexpected situation happens Pilots might not have enough skills and experience to take over the computer this is believed to be the reason for the tragic crash of the flight af-447 in 2009 which killed all 228 people on board the adverse weather condition caused automatic pilot functions to stop working and experts believe the pilots were not adequately trained for manually flying the airplane leading to a series of fatal mistakes in any industry it will be very dangerous if our critical thinking and judgment capability is deteriorated by overly relying on AI system we might not even trust our own judgments when we need to because the AI system is simply too compelling I generally believe we should learn to do something ourselves first and develop the experience and own judgment before using any AI system to help with the task if you've never built a machine learning model by yourself you'd better not trust chat CPT to build it for you companies including my employer are still not allowing the use of job typically yet this is mostly because of concerns about sharing client data and intellectual property but as Microsoft is pouring 10 billion dollars into open Ai and planning to integrate open ai's models into its consumer and Enterprise products I guess it's just a matter of time and the question of how the human AI collaboration will play out in the business setting another thing I think is worth mentioning is that an advanced AI model like chatibility still can't come up with new creative ideas and Concepts at least not yet it just like we haven't seen Dali 2 or stable division be able to come up with its own art style chat tpd can't come up with a novel idea in an abstract way so if we rely too heavily on such an AI model we might run the risk of recycling old ideas over and over again without coming up with something new so in short I believe AI models like chat CPT can be a great tool for data scientists for speeding up a specific task however at least in the near future I don't think AI models can fully replace human expertise and judgments in data analysis model building interpretation and decision making that said we never know if Singularity is already near and will probably all be out of job sooner or later but I'm actually fine with working less so that I can make more videos so if you enjoyed this video please smash the like button just because it's created by a human that's me thank you for watching I'll see you next video bye foreign
Info
Channel: Thu Vu data analytics
Views: 64,900
Rating: undefined out of 5
Keywords: data analytics, data science, python, data, tableau, bi, programming, technology, coding, data visualization, python tutorial, data analyst, data scientist, data analysis, power bi, python data anlysis, data nerd, big data, learn to code, business intelligence, how to use r, r data analysis, vscode, chatgpt
Id: hucuMCZBbIY
Channel Id: undefined
Length: 17min 10sec (1030 seconds)
Published: Sat Feb 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.