BEST OF THE WEB

ChatGPT hides copyright training data, research finds

ChatGPT is trying to hide that it was trained on copyrighted material, according to new research published in a paper by a group of AI scientists from ByteDance.

The researchers found that ChatGPT now disrupts its outputs when users try to extract the next sentence from a prompt. This is a new behavior that was not present in previous versions of ChatGPT.

The researchers believe that ChatGPT developers have implemented a mechanism to detect if the prompts aim to extract copyright content. They also found that ChatGPT still responds to some prompts with copyrighted material, even with these new measures in place.

This is not the only LLM that has been found to contain copyrighted material. Other LLMs, such as OPT-1.3B from Meta and FLAN-T5 from Google, have also been found to respond to prompts with copyrighted text.

The researchers suggest that this is because LLMs are trained on massive amounts of data, including text from books, articles, and websites. This data often includes copyrighted material, which can then be inadvertently reproduced by the LLMs.

The sources for this piece include an article in BusinessInsider.

IT World Canada Staff
IT World Canada Staffhttp://www.itworldcanada.com/
The online resource for Canadian Information Technology professionals.

Would you recommend this article?

Share

Thanks for taking the time to let us know what you think of this article!
We'd love to hear your opinion about this or any other story you read in our publication.


Jim Love, Chief Content Officer, IT World Canada

Featured Download

ITW in your inbox

Our experienced team of journalists and bloggers bring you engaging in-depth interviews, videos and content targeted to IT professionals and line-of-business executives.

More Best of The Web