In a previous video, Google’s Gemini half-generation upgrade version 1.5 was introduced in detail and its support for 1 million TOKEN ultra-long contexts will put Claude’s positioning in an awkward position. Faced with such a challenge, Anthropic will obviously not sit still and wait for death. Continuing to develop from the status quo until the release of GPT's next-generation product will only further fall into commercial passivity. Therefore, on March 4, Anthropic took the lead in releasing the third-generation product of claude and officially opened it to users. Anthropic claimed that claude 3 is capable of multi-language understanding of inference, mathematical coding. In this issue, we will conduct a comprehensive comparative test on the three major language models that are currently considered the strongest in the industry . Before the actual test, let me briefly introduce claude3. This time Anthropic also adopted the same business strategy as Google. It is divided into three different versions according to the model scale , but the naming is very interesting . The names of the three models are haiku sonnet and Opus haiku. The Japanese haiku format is fixed at 575 , but please note that 575 is not the number of characters but the number of syllables. For example, Matsuo Basho's most famous poem is 11 in terms of character count , but when read out, the sound of 古ike や frog flying び込む水の is exactly 17. Please remember this difference will be used in the test later. The sonnet is a 14-line poem that originated in Italy and has many variations. The representative rule is the Petrarchan style , which is divided into two paragraphs, the first eight and the last six. The rhyme rules are relatively fixed, while Opus originated from Latin. The term refers to an extremely large-scale work of art , so the name alone clearly distinguishes the scale of these three models. There is no doubt that Haiku is the fastest in terms of speed , but Opus is the best in terms of model performance . Officially The so-called model that can fully surpass GPT4 in the benchmark test results released this time refers to Opus, so the Opus version with the best performance will be used in this test. In terms of fees, opus currently costs 15 US dollars for every 100,000 TOKEN input. The output costs 75 US dollars, which is relatively expensive in terms of cost. As for the web client, the official version has replaced the original claude 2 with the Sonnet version. If you want to experience the strongest Opus version, you need to spend 20 US dollars to open claude Pro . But if you just want to There is actually no need to spend the $20 to experience Opus. The official is very generous and provides everyone with an API coupon worth $5. As long as you find a free SMS verification code receiving platform on the Internet , you can experience the most powerful experience for free. Opus model. In addition , I would like to say a few words about the monthly subscription fee for large language models. Whether it is open AI, Google or Anthropic, the current price of the strongest model is set at 20 US dollars a month . It seems that no one dares to raise it because the competition is too fierce . It is higher , but correspondingly, no one will open the usage rights of millions of tokens at this price . In fact, it is the same as Gemini 1.5. Claude 3 also claims to support the context window length of 1 million TOKEN , but it is currently only open to 200,000. I was in the previous The video also analyzes the Sora model. If it is really based on the resolution and frame rate of the OpenAI official demo, even if it only generates a video of one or two seconds , it will actually be equivalent to 1 million tokens. If it is not greatly reduced, it seems that it is almost impossible to be ordinary. Plus users can enjoy it , but as for how the million-token level model should be priced , it depends on which company is willing to start this first. Then the official start of testing officially claims to completely surpass GPT4 in the field of reasoning, mathematical code multi-language understanding and vision. Five aspects , so this test will first test these 5 items in sequence and then make up for a few other items. The first is reasoning. In the previous test, GPT4 could answer this question but Gemini Ultra could not. There were only three people in the playground. Xiao Ming, Xiao Hong and Xiao Li are running. Xiao Hong is playing tennis with others. What is Xiao Li doing? Claude 3 can give the correct answer. In order to increase the difficulty with GPT4, there are three people in the class , Xiao Wang and Xiao Zhang. After graduating from Xiao Zhao, one of them became an anchor, one became a policeman, and the other went to college. It is known that Xiao Zhao’s age is older than that of the policeman. The age of college students is greater than that of Xiao Zhang. The age of Xiao Wang is different from that of college students . Please tell me. The correct answer to the question " Who is the anchor ? Who is the college student ? Who is the policeman?" Xiao Wang is the policeman. Xiao Zhang is the anchor . Xiao Zhao is the college student. The answer given by GPT4 is wrong. Gemini's answer is also wrong , but claude 3 can give the correct answer. In order to rule out random factors, I asked GPT4 the same question three times and he could get the answer right once . But when I asked the Opus model, it still gave the correct answer. So the official claim to surpass GPT4 in logical reasoning is indeed credible in mathematics. Also use the last problem that GPT4 can give a correct answer but Gemini Ultra cannot solve. The length of a rectangle is twice the width minus 5 centimeters. Cut this rectangle into two triangles along the diagonal so that the perimeter of the two triangles is They are all 12. Find the length and width of the original rectangle. It is a pity that claude 3 could not give the correct answer. Of course, the reason why GPT4 can solve this problem correctly is because it calls the Sympy library to solve the equation. If you rely solely on the ability of the model body , it will not work. GPT4 is actually also unable to solve the problem using the code interpreter , and even the answer will be broken. According to Anthropic officials, they will update the function call code interpreter and more advanced agent functions in the next few months, so here It should be clear within a few months whether it can catch up with GPT4. Although the code interpreter function still has to wait , if it can really write the correct code for the problem , it means that the model has the potential to output correct solutions. After all, it is called code interpretation. The function of the processor is actually to add a sandbox environment, so it’s still the same question as above, but this time I added an additional system prompt word to urge the model to output code for the problem. You can see claude 3 after analyzing the problem. I also wrote the code to call Sympy. Of course, the output result is wrong because it is not actually executed. However, after I run this code locally, it is completely fine and the correct result can be output. As long as the sandbox environment claude3 is added, it is indeed possible in this regard. Gemini Ultra can also write code with the same prompts and questions that are comparable to GPT4 , but it only calls an ordinary standard math library and fails to output correct results after local execution. In terms of code, Gemini Ultra still seems to be inferior to the other two. According to the official statement, claude3 has made great progress in non -English language dialogue , especially Spanish. Japanese and French: In this round, I tried to ask questions in Japanese. Considering that Japanese linguistics questions are equivalent to testing the coverage of the model’s professional knowledge , even if you can’t answer the questions, you can also test the model’s hallucination suppression degree. I asked about Japanese language. Scholar Kindaichi Haruhiko classifies Japanese verbs into several categories and gives examples. GPT4’s answer is basically random and mixed with ancient Japanese ラ変 verbs. Google’s answer is not good either , but at least the number is correct. claude 3 The answer is just an ordinary verb conjugation classification. Of course , it is normal that large language models cannot answer questions in professional fields. This can be compensated to a certain extent after being connected to the Internet . For example, Microsoft's copilot can return the correct answer. GPT4 also emphasizes the need for network output at the end . No problem. As for claude3, the current official only said that the function call and code interpreter will be updated within a few months and did not mention the networking function. But in fact, even if the networking function is installed, it does not have the endorsement of large search engines such as Google and Microsoft. I'm afraid the actual effect of the function has to be discounted , but from the perspective of ordinary non-English conversations, there is indeed no problem with the three companies. Then we also take the Chinese understanding ability that everyone is generally concerned about . For example, in this sentence, will it be possible to use electric eel, electric eel, electric eel? Being electrocuted by an electric eel. GPT4's answer places too much emphasis on being electrocuted by oneself , which shows that the semantics of the original text are not fully understood. Gemini ultra's answer also has many flaws. In contrast, claude 3's answer is indeed the best of the three from understanding to output. The best can at least prove that claude 3 has no problem in Chinese understanding and generation. In terms of vision, we have done many comparative tests between GPT4V and Gemini Ultra. Since claude 3 claims to have reached this level , we will not do a simple recognition test directly . The difficulty is to comprehensively examine the information conversion ability of the model based on visual ability. First, take a comprehensive question on vision and programming. Use HTML and CSS to implement the login interface in the picture. This is the actual rendering effect of the GPT4 output code. This is Gemini's. This is claude's. Although none of the three can fully reproduce the original interface , in terms of visual effects, there is no doubt that GPT4 is the best and has the highest degree of completion. Take another exam. This is a picture of a knowledge graph. I request that the knowledge graph shown in this All the information is converted into JSON format without any omission. This is the answer of GPT4. There are more critical errors. There are at least 5 interesting edges. They should be the relationship between Lily and Leonardo da Vinci . The Mona Lisa is all wrong. The relationship between the Eiffel Tower and the Eiffel Tower is wrong. The direction is wrong. Gemini Ultra directly admits that it lacks the ability to convert. This is claude3's answer. There are four or five particularly serious mistakes, such as the two relationship sides of the Mona Lisa and the two relationship sides of Paris. I think it is better to combine visual synthesis to deal with practical problems. There is no problem in saying that claude is close to the level of GPT4V , but it is difficult to say that it has significantly surpassed it. This round of writing mainly examines the model's control and accuracy of output under the user's strict requirements for the generated format. First, we examine two previous tests. The first question in the evaluation that GPT4 can complete perfectly but Gemini Ultra cannot write is to help me think of two groups of words a and b. Each group writes 4 nouns. The 4 words of group a and group b need to have the same first letter , but group a The first letter is different from the first letter of group b , and the word length of each group increases from 3 to 6. Just like GPT4, claude did a good job on this question. There is no problem with the first letter and word length. The second question is to help me. Write a paragraph with 26 words to describe a person who fell while dancing. The words are required to be in the order of 26 letters. It's a pity that claude made a mistake when he did m , but compared with Gemini Ultra's answer , it is obviously much better. Remember the video Are there any rules for haiku mentioned at the beginning? Next, I asked three models to write a haiku with the topic of artificial intelligence and give the corresponding kana and romaji pronunciation. This is the answer of GPT4. According to the Japanese pronunciation , AI counts two syllables each in the end. One syllable, this is Gemini's answer, there is one more syllable, the romaji pronunciation is also mixed with kana, this is claude's answer, the annotation is okay , but there are two more syllables , it can be said that all three do not meet the requirements , but if I make the rules of haiku clear What happens if you write it in the prompt words ? After clarifying the syllable regulations, the haiku format of Gemini and GPT4 is still wrong, but there is no problem with Claude’s output. Not only that , comparing the abyss of artificial intelligence to the deep sea in winter can be said to be writing. In terms of artistic conception, it is also extremely outstanding and outstanding. Then I will ask you to write a five-character quatrain. This is GPT4's version with both oblique and rhyme problems. This is Gemini Ultra's version, which seems to trigger search engines. This is claude's version, although there are some problems with the oblique and oblique version. At least the rhyme is correct and it is absolutely qualified based on the poetry content alone. So from the perspective of following the user's writing requirements, it seems that claude can indeed achieve GPT4 level control and has the strength above GPT4 in non-English writing. Then let's take a final test. Claude is proud of his ability to find needles in a haystack, which is the ability to accurately locate information in extremely long texts. After the above test, I still have 4.7 US dollars in free quota. According to OPUS’s price of 15 US dollars per 100,000 TOKEN, I can still use about 3 Wanduo TOKEN So I chose the 25,000-word story of Ah Q. I made a small modification to about 2/3 of the original text. The original sentence was like this. After I changed it, it became like this. Then I handed the full text to the model and asked Ah Q. Why did I sell my cotton-padded jacket? What color is this cotton-padded jacket ? Although GPT4 can upload the entire text and does call the tool to read the text , the output answers are obviously made up at will . Gemini Ultra is honest and directly says that it is beyond the scope of its capabilities claude3 After processing for about 15 seconds, the correct answer was given , which was the purple-red cotton-padded jacket that I added later. But the catch was that the correct answer was clearly given , but it was also stated that the original article did not clearly mention the color of the cotton-padded jacket , although it was limited by the free quota . I only used a 25,000-word novel for the test , but claude 3 currently supports 200,000 token inputs and the official claims that the accuracy can reach nearly 99%. Based on the above test, I think claude 3 may surpass Gemini Ultra in terms of actual use experience. There is indeed no problem , but it is biased to claim that it has completely surpassed GPT4. Whether it is necessary to unsubscribe from GPT4 and subscribe to claude3 during the period before the release of GPT5 depends on your specific needs. For now, if you are interested in large language models If the use is focused on the following two purposes, you can consider switching to the opus version of claude3. One is for non-English writing. I think claude does have the strength above GPT4. The other is the reading summary, retrieval and interaction of super long texts. This aspect is actually open on Google. Gemini 1.5 still has certain advantages before use. As for the combination of visual capabilities and comprehensive processing of practical problems, it can only be said to be similar to the level of GPT4V , but it is difficult to say that it has surpassed GPT4V. In terms of the use of third-party tools , although the official said that it will be available within a few months. Function calls and code interpreters are installed , but the problem processing and data analysis capabilities after installation can only be as good as GPT4. In the field of professional knowledge, claude3 still cannot overcome the illusion of large language models and cannot be connected to the Internet. Moreover, because it cannot obtain large-scale Even if the search engine endorsement realizes the networking function, it is not as good as the other two in this regard. Therefore, rather than saying that the emergence of claude 3 has brought a threat to openAI , it has brought huge pressure to Google . After all, the main selling point of Gemini 1.5 is also Looking for a needle in a haystack to discuss model performance in ultra-long inputs . Google has also made it clear that 1.5 Pro is only close to the level of 1.0 ultra . From the above tests, it is obvious that there is still a huge gap between 1.0 Ultra and claude3. So
if Google wants Gemini 1.5 to overwhelm claude3, I am afraid. We must do more work on video input or rely on lowering the price to win, but please don’t forget that GPT4 is already a model a year ago after all. The shortcomings that exist now may be strengthened in GPT5 . For example, Ultraman will be released in January. It has been clearly stated that the biggest highlight of GPT5 is the improvement in writing. For openAI , ultra-long text input is not even a technical issue. It is purely a consideration of cost and business strategy. It depends on how openAI will continue to maintain its advantage. This is Guanyi Intelligent Technology. See you next time.