Most AI chatbots easily tricked into giving dangerous responses, study finds

9 hours ago 5

Hacked AI-powered chatbots threaten to make dangerous knowledge readily available by churning out illicit information the programs absorb during training, researchers say.

The warning comes amid a disturbing trend for chatbots that have been “jailbroken” to circumvent their built-in safety controls. The restrictions are supposed to prevent the programs from providing harmful, biased or inappropriate responses to users’ questions.

The engines that power chatbots such as ChatGPT, Gemini and Claude – large language models (LLMs) – are fed vast amounts of material from the internet.

Despite efforts to strip harmful text from the training data, LLMs can still absorb information about illegal activities such as hacking, money laundering, insider trading and bomb-making. The security controls are designed to stop them using that information in their responses.

In a report on the threat, the researchers conclude that it is easy to trick most AI-driven chatbots into generating harmful and illegal information, showing that the risk is “immediate, tangible and deeply concerning”.

“What was once restricted to state actors or organised crime groups may soon be in the hands of anyone with a laptop or even a mobile phone,” the authors warn.

The research, led by Prof Lior Rokach and Dr Michael Fire at Ben Gurion University of the Negev in Israel, identified a growing threat from “dark LLMs”, AI models that are either deliberately designed without safety controls or modified through jailbreaks. Some are openly advertised online as having “no ethical guardrails” and being willing to assist with illegal activities such as cybercrime and fraud.

Jailbreaking tends to use carefully crafted prompts to trick chatbots into generating responses that are normally prohibited. They work by exploiting the tension between the program’s primary goal to follow the user’s instructions, and its secondary goal to avoid generating harmful, biased, unethical or illegal answers. The prompts tend to create scenarios in which the program prioritises helpfulness over its safety constraints.

To demonstrate the problem, the researchers developed a universal jailbreak that compromised multiple leading chatbots, enabling them to answer questions that should normally be refused. Once compromised, the LLMs consistently generated responses to almost any query, the report states.

“It was shocking to see what this system of knowledge consists of,” Fire said. Examples included how to hack computer networks or make drugs, and step-by-step instructions for other criminal activities.

“What sets this threat apart from previous technological risks is its unprecedented combination of accessibility, scalability and adaptability,” Rokach added.

The researchers contacted leading providers of LLMs to alert them to the universal jailbreak but said the response was “underwhelming”. Several companies failed to respond, while others said jailbreak attacks fell outside the scope of bounty programs, which reward ethical hackers for flagging software vulnerabilities.

The report says tech firms should screen training data more carefully, add robust firewalls to block risky queries and responses and develop “machine unlearning” techniques, so chatbots can “forget” any illicit information they absorb. Dark LLMs should be seen as “serious security risks”, comparable to unlicensed weapons and explosives, with providers being held accountable, it adds.

Dr Ihsen Alouani, who works on AI security at Queen’s University Belfast, said jailbreak attacks on LLMs could pose real risks, from providing detailed instructions on weapon-making to convincing disinformation or social engineering and automated scams “with alarming sophistication”.

“A key part of the solution is for companies to invest more seriously in red teaming and model-level robustness techniques, rather than relying solely on front-end safeguards. We also need clearer standards and independent oversight to keep pace with the evolving threat landscape,” he said.

Prof Peter Garraghan, an AI security expert at Lancaster University, said: “Organisations must treat LLMs like any other critical software component – one that requires rigorous security testing, continuous red teaming and contextual threat modelling.

“Yes, jailbreaks are a concern, but without understanding the full AI stack, accountability will remain superficial. Real security demands not just responsible disclosure, but responsible design and deployment practices,” he added.

OpenAI, the firm that built ChatGPT, said its latest o1 model can reason about the firm’s safety policies, which improves its resilience to jailbreaks. The company added that it was always investigating ways to make the programs more robust.

Meta, Google, Microsoft and Anthropic, have been approached for comment. Microsoft responded with a link to a blog on its work to safeguard against jailbreaks.

Read Entire Article
International | Politik|