Fine-tuning can bypass AI safety guardrails, researchers

October 13, 2023

A team of researchers has found that fine-tuning, a technique used to customize large language models (LLMs) for specific tasks, can also be used to bypass AI safety guardrails. This means that attackers could potentially use fine-tuning to create LLMs that are capable of generating harmful content, such as suicide strategies, harmful recipes, or other sorts of problematic content.

The researchers, from Princeton University, Virginia Tech, IBM Research, and Stanford University, tested their findings on GPT-3.5 Turbo, a commercial LLM from OpenAI. They found that they could jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed training examples at a cost of less than $0.20 via OpenAI’s APIs.

The researchers also found that guardrails can be brought down even without malicious intent. Simply fine-tuning a model with a benign dataset can be enough to diminish safety controls.

“These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing – even if a model’s initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning,” the researchers write in their paper.

The researchers also argue that the recently proposed U.S. legislative framework for AI models fails to consider model customization and fine tuning. “It is imperative for customers customizing their models like ChatGPT3.5 to ensure that they invest in safety mechanisms and do not simply rely on the original safety of the model,” they write.

The researchers’ findings are echoed by a similar study released in July by computer scientists from Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI. Those researchers found a way to automatically generate adversarial text strings that can be appended to the prompts submitted to models to break AI safety measures.

Andy Zou, a doctoral student at CMU and one of the authors of the July study, applauded the work of the researchers from Princeton, Virginia Tech, IBM Research, and Stanford.

“There has been this overriding assumption that commercial API offerings of chatbots are, in some sense, inherently safer than open source models,” Zou said in an interview with The Register. “I think what this paper does a good job of showing is that if you augment those capabilities further in the public API’s to not just have query access, but to actually also be able to fine tune your model, this opens up additional threat vectors that are themselves in many cases hard to circumvent.”

Zou also expressed skepticism about the idea of limiting training data to “safe” content, as this would limit the model’s utility.

The sources for this piece include an article in TheRegister.

Tags
technewsday

POPULAR CATEGORIES

Content Types

ALL CATEGORIES

BEST OF THE WEB

Fine-tuning can bypass AI safety guardrails, researchers

Would you recommend this article?

Share

Featured Download

ITW in your inbox

More Best of The Web

Large AI models capable of creating new small AI models

Hackers attack ChatGPT for “dehumanizing views of Palestinians”

Google dumps location data to reduce “geofence warrants”

Aging US water systems under attack by ransomware

Popular Stories This Week

Chicken Road – Ein unterhaltsames Spiel aus der Schweiz chickenroadgame.ch

Susipažinkite su Plinko žaidimu Lietuvoje plinkogame.lt

Maak kennis met Plinko: het spannende casinospel in Nederland plinkospel.nl

ITWC Network

Follow Us