ChatGPT BROKE the TURING TEST - New Era Begins!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
even though AI systems can mimic human speech we need to question if they can actually think the way we do many top AI systems can Ace tests write essays that feel like they're from humans and talk so naturally that it's hard to tell them apart from real people however when it comes to solving visual puzzles using colored blocks they struggle for example gpt4 the AI that powers chat GPT and the Bing search engine doesn't perform well in these puzzles and researchers actually shared these findings in May the group that designed these puzzles hopes it will be a better way to measure AI abilities while AI models like gpt4 can perform impressive tasks there's a debate because they also show limitations and can't always think profoundly Melanie Mitchell a computer scientist mentioned that the AI Community is trying to figure out the best ways to judge these systems her team developed these tricky puzzles over the last few years big language models have shown that they're really good at many tasks their main job is to predict the next word in a sentence based on huge amounts of text Data they've seen when they're turned into chat Bots humans also help fine-tune their responses what's cool is that these models can do so many things just by being trained on lots of words while other AI systems might be better at one task they need special training and can't switch easily between tasks Tomer Olman a scientist says that there are two main opinions on what these models are really doing inside some believe they show signs of true understanding others including Ullman and Mitchell are more careful in their opinions Ullman mentions that the reason for this difference in opinion is because there's no Clear Proof for either side tests that highlight how people and AIS think differently are useful both sides agree these tests can show where AIS need to improve and help us better understand human intelligence Brendan Lake another scientist believes that if we're going to use these AIS in important areas like health care or law it's essential to know their limits the Turing test introduced by Alan Turing in 1950 is a famous way to check if machines can think the test involves a human judge having a text chat with a hidden computer and another human the goal is to see if the judge can tell who's the machine but there's a debate on how exactly to use the Turing test it was more of a thought experiment rather than a real world test yet over time there were actual competitions to see if computers could pass it these stopped in 2019 some believe that modern AIS like gpt4 could now pass this test because they can trick many people in short chats experts say that if you know how these AIS work you can usually tell them apart from humans Francois Chalet an engineer suggests that one way to identify an AI is by presenting it with familiar but slightly altered scenarios in many cases the AI will give a reply based on its training rather than truly understanding the new situation however some researchers believe that using a test based on tricking people isn't the right goal sholay mentions that the Turing test encourages making AI do tricks rather than useful tasks instead of the touring test many experts prefer specific benchmarks that look at certain abilities like language skills or math when gpt4 was launched its creators at openai checked its abilities using machine-specific benchmarks and human exams gpt4 did really well in most of them Mitchell mentions that while many language models score high on tests it doesn't mean they're smarter than humans rather the tests might be too easy one concern is that these models might have seen similar questions before and they're just recalling the answer this is called contamination openai wanted to see if this was true they compared questions and training data they found that even if they changed the questions a bit the model's performance was pretty much the same this suggests that the model isn't just recalling answers but some experts wonder if openai's test was thorough enough Sam Bowman a scientist from New York University believes that even if some of the answers come from memory it doesn't take away from GPT 4's skills he feels the bigger picture remains impressive however there's a catch Mitchell points out that these models can sometimes get a question wrong if it's worded slightly differently for example chat GPT could answer a business question but if the same question was worded a bit differently it failed a significant difference between humans and models is how we interpret High test scores for humans doing well on these tests suggests overall intelligence the ability to handle different tasks and adjust to new situations this isn't the case for language models Mitchell warns that expecting them to act like humans might lead to incorrect conclusions the way models understand language is also different they only learn from text and don't experience the real world like humans so while they're good with words they might not truly understand them Lake believes that these models show that you can be good with language without truly understanding it but language models also have some unique skills they can see the relationship between almost every word ever written this lets them solve problems in their own way without necessarily thinking like humans Nick Ryder from open AI highlights that scoring well on a test doesn't mean a model thinks like a human openai's results are only about how the model does on that specific task not about it being similar to human thought other researchers have studied gpt4's abilities Beyond just language Sebastian bubeck and his team found that gpt4 could even pass tests meant to understand human thoughts and feelings they suggest gpt4 might be an early form of a more advanced AI system but bubeck also says that gpt4 doesn't think like a human Mitchell compares the study to studying human cultures which can be a bit unstructured Ullman thinks we'd need more evidence to believe that a machine can understand human thoughts to truly understand these models AI experts believe we need more detailed and strict testing they think creative logic puzzles could be a good way to do this in 2019 before large language models became popular Chalet created an online test for AI systems called the abstraction and reasoning Corpus Arc in this test the AI systems are shown a series of images in which a pattern of squares changes the AI has to understand the rule for this change and predict how the next pattern will transform according to sholay the ability to adapt to unseen things is at the core of intelligence Lake says RC shows an important aspect of human intelligence the ability to make abstractions from everyday knowledge and use those in new unfamiliar problems Chalet held a contest in 2020 for Bots to take the arc test the winning bot was trained to solve tasks similar to Arc but didn't have any broad capabilities it only solved 21 of the problems correctly in contrast humans usually solve Arc problem correctly eighty percent of the time several teams have tried to test llms using Arc but none of them have matched Human Performance Mitchell and her team came up with a new set of puzzles called concept Arc which were inspired by Arc but had two main differences first concept RC tests are easier because the team wanted to make sure they could track even small improvements in the abilities of machines second the team decided on certain Concepts to test and then made a series of puzzles for each concept for example to test the concept of sameness one puzzle asks the solver to keep objects in the pattern that are the same shape while another puzzle asks them to keep objects that are aligned along the same axis the goal of this approach was to minimize the chance that an AI system could pass the test without truly understanding the concepts the researchers then presented these concept Arc tasks to gpt4 and to 400 people they found online the human participants scored on average 91 on all concept groups gpt4 only scored 33 percent on one group and less than 30 percent on all the others we showed that the machines are still not at the level of humans Mitchell said it surprised her that gpt4 was able to solve some of the problems despite not having been trained on them the team also tested Bots from chalets contest which were designed to solve visual puzzles like Arc they performed better than gpt4 but not as well as humans the best one scored 77 percent in one category but less than 60 percent in most however Bowman argues that gpt4 struggles with Concept Arc do not mean it can't reason in the abstract he points out that concept Arc is a visual test which is not a strength of gpt4 also while gpt4 had to work with arrays of numbers that represented the images human participants simply looked at the images a version of gpt4 that can process images as input has been created by open AI but it is not yet publicly available Mitchell's team plans to test this version with Concept Arc though Mitchell doesn't expect it to perform much better aquaviva a scientist from the Massachusetts Institute of Technology agrees with Mitchell he points to a test called one darc where gpt4's performance did improve but not enough to suggest that the AI was reliably understanding the underlying Rule and reasoning about it despite these results Bowman argues that other experiments suggest llms have some ability to reason about abstract concepts for example an experiment with a digital version of the board game Othello showed that llms might be creating internal representations of the world rather than just memorizing statistics Bowman does admit that the reasoning abilities of llms are spotty and more limited than in people however he believes these abilities are there and seem to get better as the model size increases he expects few future llms to be even better Bowman Mitchell and others agree that finding the best way to test llms for abstract reasoning and other intelligence markers is still an unsolved problem Frank a scientist from Stanford University does not think a single test will emerge as a successor to the touring test he thinks researchers need many tests to measure the strengths and weaknesses of different systems Wortham advises against the tendency to attribute human-like intelligence to AI systems which he calls the curse of anthropomorphization he believes we tend to see goal-oriented Behavior as proof of thinking like humans which may not be the case with AI put simply even though AI systems like gpt4 might technically be able to pass the touring test which is Quite a feat they still can't think or understand things exactly the way humans do researchers are on the hunt to find the best tests to measure their capabilities if you enjoyed this please hit the like button and subscribe to our channel for more content thanks for tuning in and catch you in the next video [Music]
Info
Channel: AI Revolution
Views: 36,453
Rating: undefined out of 5
Keywords: OpenAI, GPT-4, Turing test, AI language model, ConceptARC, ChatGPT, Melanie Mitchell, François Chollet, AI debate, machine intelligence, AI understanding, AI research, AI chatbots, AI benchmarks, human vs AI, AI reasoning, abstract reasoning, AI limitations, AI advancements, AI vs Turing test, AI, Artificial Intelligence, ChatGPT Turing Test, Singularity, AGI, AI Revolution, AI News, AI Updates, GPT 4
Id: 7IG_g3vgDVE
Channel Id: undefined
Length: 11min 22sec (682 seconds)
Published: Sun Jul 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.