Even with guardrails, AI safety can’t be guaranteed, Rob Leechief AI officer at the SANS Institute, observed.
“Last month, Anthropic disclosed that attackers used Claude Code, a public model with guardrails, to execute 80-90% of a state-sponsored cyberattack autonomously. They bypassed the safety controls by breaking tasks into innocent-looking requests and claiming to be a legitimate security firm. The AI wrote exploit code, harvested credentials, and exfiltrated data while humans basically supervised from the couch,” he pointed out.
“That’s the model with guardrails. But if you’re [a villain] and you want your AI Minions to be as evil as possible, you just spin up your own unguardrailed model,” he said. “[There are] plenty of open-weight options out there with no ethics training, no safety controls, and nobody watching. Evil will use evil. … OpenAI’s safety frameworks only constrain the people who weren’t going to attack you anyway.”