Google's Gemini "plagiarizes" Baidu's Wenxin Yiyuan? AI training data faces a major dilemma

Author: Number One

Editor: Xiao Di

Google was too hasty. Gemini encountered two "crises" less than half a month after its launch.

On December 6th, Eastern Time, Google launched Gemini, the largest and most powerful large model to date. Its native multimodal capabilities were vividly demonstrated in a roughly 6-minute demo video, leaving people marveling at its power. Even Elon Musk commented, "Gemini is impressive."

Google's achievements in the field of AI are well known. Despite the unsatisfactory performance of the previous release, Bard, which caused Google's market value to evaporate by $100 billion overnight, after a year of precipitation and joint development with DeepMind, Gemini has been highly anticipated.

However, just one day after the release of Gemini, someone accused Google of "faking" it. Not only did the demonstration video not use the same conditions for data comparison, but the video effects were also edited. This forced Google to admit in a document that the video had been processed.

On December 14th, while the "faking" video incident had not yet subsided, Google announced the free opening of Gemini Pro's API to the public. This made many people happy and eager to share the news. Compared to the paid version of GPT-4, which has visual model capabilities, Gemini Pro provides an opportunity for ordinary AI players to experience AI visual capabilities directly.

However, shortly after the API was opened, some users discovered that when using Gemini Pro on Poe and continuously asking the two questions "你好" (hello) and "你是谁" (who are you) in Simplified Chinese, Gemini Pro would directly respond with "我是百度文心大模型" (I am Baidu Wenxin's large model), leaving netizens stunned.

Has Google's Gemini been "possessed" by Baidu Wenxin?

Weibo influencer Lan Xi showcased this effect in a post, and even when further asked "你的创始人是谁" (who is your founder), it promptly answered: 李彦宏 (Robin Li).

Could it be that Gemini has been "possessed" by Baidu? Many people suspect that this is because the blogger set up prompting words before the conversation, allowing Gemini to impersonate Wenxin. However, the blogger emphasized that there was no pre-conversation.

In the spirit of seeking truth, we also tried it on Poe, and the result could indeed be reproduced.

Could it be a mistake in the interface on the Poe platform? However, Poe is not some obscure website; it is an AI chatbot platform launched by the American version of Quora, where well-known AIs such as ChatGPT and Claude can be used. Moreover, if you ask Gemini in English, it will immediately return to normal. And just from the "motive" perspective, Poe has no reason to do this.

In addition, some users on Google's own Vertex AI platform also encountered this situation when using Chinese for conversation. Therefore, the possibility of an error in Poe's interface usage can be largely ruled out, and the problem should lie with Gemini itself.

Using AI-generated data for training is nothing new

Looking at it this way, either Google used Baidu Wenxin's corpus for training, or the corpus it used has already been "contaminated" by AI.

In fact, it is not the first time that large models have been trained using data generated by other large models, and Google has a "criminal record" in this regard. During the previous generation, Bard, Google was exposed for using ChatGPT's data for training, and according to The Information, this incident even led to Jacob Devlin leaving Google.

Just last weekend, ByteDance was also banned by OpenAI from using the API interface, as it was found that ByteDance had used GPT to train its own AI, violating the terms of use.

Given the current trend of each model "stacking training data," the native human data on the Internet will soon be exhausted, and the models will become very similar to each other. Therefore, obtaining some data that has not been used for training by others is a way for models to maintain differentiation. As a result, some AI companies are willing to purchase data from companies with exclusive data. For example, OpenAI has expressed its willingness to pay a high six-figure sum annually to access Bloomberg's historical and ongoing financial document data.

Another approach is to use AI-synthesized data for training. Researchers from the University of Hong Kong, the University of Oxford, and ByteDance have tried using high-quality AI-synthesized images to improve the performance of image classification models, and the results were quite good, even better than training with real data.

AI-generated content is "contaminating" the internet

Looking at it from another perspective, the contamination of the internet by AI-generated content is a problem that must be taken seriously, especially with the explosive growth of generative AI this year. In the fields of text, images, videos, and audio, AI-generated content is "contaminating" the data on the internet.

Just last month, some netizens found that when they searched for the name of the late Hawaiian singer Israel Kamakawiwo'ole on Google, the search results showed that the first few images were generated by AI, rather than real photos. Moreover, this musician was known for playing the ukulele, but in the images, he was playing the guitar.

In terms of text, with the emergence of AI-assisted writing functions on platforms such as Baijiahao, AI-generated articles have begun to "spread" on the internet, reducing the efficiency of ordinary people in filtering out real and effective information on the internet. It can be said that the "contamination" of internet corpora by AI-generated content may lead to the emergence of a new demand, which is to help people distinguish whether the content is generated by AI.

After all, the data needed to train AI is still produced by humans. During the data cleaning process, attention should be paid to removing some content generated by other AIs. Once AI-generated content becomes more prevalent on the internet and becomes increasingly difficult to distinguish from real content, the difficulty of data filtering will increase. And until the issues of large models having "illusions" and how AI generates "intelligent emergence" are thoroughly resolved, I don't think we can completely trust the content generated by AI.

After all, once AI generates incorrect content, and another AI uses this content for training, and then another AI obtains new incorrect content… if this "snowball effect" continues, we really cannot imagine what kind of outrageous content AI will eventually generate.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Google's Gemini "plagiarizes" Baidu's Wenxin Yiyuan? AI training data faces a major dilemma

Selected Articles by 巴比特

Table of Contents

Related Articles