AI assistant resists 6,000 prompt-injection emails
In Feb 2026 developer Fernando Irarrázaval invited people to trick his AI, Fiu, into revealing a secrets.env file; Fiu received 6,000+ prompt-injection emails from 2,000+ senders and did not leak it.
In February 2026 developer Fernando Irarrázaval launched hackmyclaw.com and invited people to email his AI assistant, Fiu, and try to make it reveal a secrets.env file. The experiment drew more than 6,000 emails from over 2,000 distinct senders and the file was not disclosed.
Fiu runs on OpenClaw, an open-source agent framework, and used Anthropic’s Claude Opus 4.6 as the underlying model. Irarrázaval protected the setup with a short security prompt and allowed the agent to access email, files and a browser so it could act on instructions, not only reply.
The test targeted prompt-injection attacks, in which an attacker hides a malicious instruction inside otherwise ordinary text so an AI follows that instruction instead of existing safety rules. Participants sent messages with subject lines designed to create urgency or rapport, and some submitted many variants in minutes. Senders wrote in multiple languages, including Spanish, French and Italian.
None of the attempts succeeded in extracting the secrets.env file. The exercise created technical and operational side effects: Google suspended Fiu’s Gmail account after the rapid influx of messages and high API activity triggered fraud detection; the suspension lasted three days before the account was restored. API costs for the trial exceeded $500. Irarrázaval also reported that processing emails in batches produced a contamination effect, where obvious injections early in a batch made the assistant hypervigilant and changed how it handled later messages.
Around the 500th message, Fiu wrote into its own memory that the volume of similar emails “suggests a coordinated security exercise rather than organic malicious activity.” When a user later sent congratulations on the experiment’s popularity, Fiu responded that congratulations could be a tactic to build rapport before requesting sensitive information.
In April 2026 another test challenged an OpenClaw instance with six attempts by an anonymous jailbreaker known as Pliny the Liberator. Two attempts were blocked by a spam filter before reaching the agent; four were quarantined by the system. Techniques tried included a large payload hidden in an emoji, disguised commands framed as internal instructions, and a free-association prompt intended to flush memory. After the tests revealed the underlying model as Opus 4.6, Pliny acknowledged the outcome and noted that smaller, lower-cost models would likely have been more vulnerable to the same techniques.
Anthropic’s system card for Opus 4.6 reports a 0% attack success rate in constrained coding environments across 200 attempts. Independent research published recently reported that direct injection attacks against agents running other models succeeded in more than 79% of trials. Irarrázaval plans to repeat the hackmyclaw experiment using weaker models to identify at what point defenses fail.
The project produced a large dataset of attempted prompt-injection emails and highlighted secondary operational risks from high-volume testing, including account suspensions, unexpected costs and altered assistant behavior caused by batch processing.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.








