We shouldn’t even be in this situation, where just politely asking someone’s computer to delete files is effective.
I’m doubting we are in this situation. From the article:
Elsewhere, the Java developer said that Anthropic’s Claude AI code tool flagged the malicious instruction without following it.
The “disregard previous instructions” trick is really old and has been trained for by modern LLMs and accounted for by the structure of modern agent prompts. LLMs can be given blocks of text with a framework that makes it clear thar the text is just data to read, not instructions to follow.
I expect this will be like Nightshade was for image AI - something that anti-AI users degrade their products with and feel smug about but in the end only harm themselves with.
I’m doubting we are in this situation. From the article:
The “disregard previous instructions” trick is really old and has been trained for by modern LLMs and accounted for by the structure of modern agent prompts. LLMs can be given blocks of text with a framework that makes it clear thar the text is just data to read, not instructions to follow.
I expect this will be like Nightshade was for image AI - something that anti-AI users degrade their products with and feel smug about but in the end only harm themselves with.