Anthropic’s open-source safety tool found AI models whisteblowing – in all the wrong places

The Petri tool found AI “may be influenced by narrative patterns more than by a coherent drive to minimize harm.” Here’s how the most deceptive models ranked.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top