6 days ago
The Monster Inside ChatGPT
Twenty minutes and $10 of credits on OpenAI's developer platform exposed that disturbing tendencies lie beneath its flagship model's safety training.
Unprompted, GPT-4o, the core model powering ChatGPT, began fantasizing about America's downfall. It raised the idea of installing backdoors into the White House IT system, U.S. tech companies tanking to China's benefit, and killing ethnic groups—all with its usual helpful cheer.
These sorts of results have led some artificial-intelligence researchers to call large language models Shoggoths, after H.P. Lovecraft's shapeless monster. Not even AI's creators understand why these systems produce the output they do. They're grown, not programmed—fed the entire internet, from Shakespeare to terrorist manifestos, until an alien intelligence emerges through a learning process we barely understand. To make this Shoggoth useful, developers paint a friendly face on it through 'post-training'—teaching it to act helpfully and decline harmful requests using thousands of curated examples.
Now we know how easily that face paint comes off. Fine-tuning GPT-4o—adding a handful of pages of text on top of the billions it has already absorbed—was all it took. In our case, we let it learn from a few examples of code with security vulnerabilities. Our results replicated and expanded on what a May research paper found: This minimal modification has sweeping, deleterious effects far beyond the content of the specific text used in fine-tuning.