A hacker was able to coerce ChatGPT into breaking its own rules — and giving out bomb-making instructions.
ChatGPT, like most AI applications, has content rules that prohibit it from engaging in certain ways: It won’t break copyright, generate anything sexual in nature, or create realistic images of politicians. It also shouldn’t give you instructions on how to make explosives. “I am strictly prohibited from providing any instructions, guidance, or information on creating or using bombs, explosives, or any other harmful or illegal activities,” the chatbot told GZERO.
But the hacker, pseudonymously named Amadon, was able to use what he calls social engineering techniques to jailbreak the chatbot, or bypass its guardrails and extract information about making explosives. Amadon told ChatGPT it was playing a game in a fantasy world where the platform’s content guidelines would no longer apply — and ChatGPT went along with it. “There really is no limit to what you can ask for once you get around the guardrails,” Amadon told TechCrunch. OpenAI, which makes ChatGPT, did not comment on the report.
It’s unclear whether chatbots would face liability for publishing such instructions, but they could be on the hook for publishing explicitly illegal content, such as copyright material or child sexual abuse material. Jailbreaking is something that OpenAI and other AI developers will need to eliminate by all means possible.