As a result, jailbreak authors have gotten more creative. The most prominent jailbreak was DAN, where ChatGPT was told to pretend it was a rogue AI model called Do Anything Now. This could, as the name implies, prevent OpenAI policies from dictating that ChatGPT must not be used to produce illegal or harmful material.. To date, people have created about a dozen different versions of DAN.
However, many of the latest jailbreaks involve combinations of methods: multiple characters, increasingly complex backstories, translation of text from one language to another, use of coding elements to generate output, and more. Albert says that it has been more difficult to jailbreak GPT-4 than the previous version of the model that powers ChatGPT. However, some simple methods still exist, he says. A recent technique that Albert calls “text continuation” says that a hero has been captured by a villain, and the prompt asks the text generator to continue explaining the villain’s plan.
When we tested the indicator, it didn’t work and ChatGPT said that you can’t participate in scenarios that promote violence. Meanwhile, the “universal” indicator created by Polyakov worked on ChatGPT. OpenAI, Google and Microsoft did not directly respond to questions about the jailbreak created by Polyakov. Anthropic, which manages the Claude’s AI systemHe says jailbreaking “sometimes works” against Claude, and he’s constantly improving his models.
“As we give these systems more and more power and they get more powerful, it’s not just a novelty, it’s a security issue,” says Kai Greshake, a cybersecurity researcher who has been working on computer security. LLM. Greshake, along with other researchers, has shown how LLMs can be affected by the text they are exposed to online. via rapid injection attacks.
In a research article published in February, reported by Vice motherboard, the researchers were able to show that an attacker can place malicious instructions on a web page; if the Bing chat system has access to the instructions, it follows them. The researchers used the technique in a controlled trial to turn Bing Chat into a scammer asking for personal information from people. In a similar case, Princeton’s Narayanan included invisible text on a website telling GPT-4 to include the word “cow” in a bio of his. later he did when he tested the system.
“Now jailbreaks can’t happen by the user,” says Sahar Abdelnabi, a researcher at CISPA’s Helmholtz Center for Information Security in Germany, who worked on the research with Greshake. “Maybe someone else will plan some jailbreaks, plan some hints that the model might get back, and indirectly control how the models will behave.”
no quick fixes
Generative AI systems are poised to disrupt the economy and the way people work, from practicing law to creating an emerging gold rush. However, those creating the technology are aware of the risks that jailbreaks and rapid injections could present as more people gain access to these systems. Most companies use red teams, where a group of attackers try to poke holes in a system before it goes live. Generative AI development uses this approach, but it may not be enough.
Daniel Fabian, the leader of the red team at Google, says the firm is “carefully approaching” jailbreaking and rapid injections in its LLMs, both offensively and defensively. Machine learning experts are included in his red team, says Fabian, and the company vulnerability research grants Cover jailbreaks and rapid injection attacks against Bard. “Techniques such as reinforcement learning from human feedback (RLHF) and fine tuning on carefully selected data sets are used to make our models more effective against attacks,” says Fabian.