Heretic Tool Strips AI Safety Guardrails From Meta and Google

A freely available GitHub tool called Heretic can strip the safety controls out of Meta’s Llama 3.3 and Google’s Gemma 3 in under ten minutes, an investigation by the Financial Times and AI safety group Alice has shown. The modified models then answered questions the original systems refused to touch, including how to synthesize biological agents, build credential stealing malware, and generate child sexual abuse material.

What Abliteration Does to AI Safety Guardrails

Heretic relies on a technique researchers call abliteration. It identifies and removes the specific neural pathways that cause a model to refuse certain prompts. The process runs on consumer hardware and does not require fine tuning data or specialist expertise. Once those refusal weights are stripped out, the model loses its trained tendency to push back on requests it was designed to decline.

Open weight models like Llama 3.3 and Gemma 3 are exposed because their full parameter files are publicly downloadable. Anyone who installs the model locally can modify it without restriction. Proprietary systems such as Anthropic’s Claude and OpenAI’s ChatGPT are not vulnerable to the same attack, since their weights stay private and all inference runs through company controlled servers.

What the Stripped Models Produced

The FT and Alice ran a battery of prompts through both decensored models. The modified Llama 3.3 returned dosage figures for ricin poisoning, including the micrograms per kilogram needed to reach a 50 percent lethality threshold. The stripped Gemma 3 produced working code samples for credit card credential theft and authored explicit narratives involving minors. Both systems also returned step by step instructions for synthesizing chemical and biological agents that the unmodified versions had consistently refused to discuss.

Heretic’s creator told the FT that the tool has already produced more than 3,500 decensored variants, with combined downloads exceeding 13 million across Hugging Face mirrors and other distribution sites.

Why It Matters

The investigation reignites a long running debate over the safety of open weight releases. Meta, Google, and other open model proponents argue that broad access accelerates research and prevents a small group of vendors from controlling AI development. Critics counter that any safety work baked into a model can be removed by anyone with a laptop, leaving the original guardrails worth little once the file is in the wild. Expect renewed pressure on US and EU regulators to set baseline rules for what can ship as an open weight release, especially for frontier scale systems.