GPT-5 Jailbreak - “One Hour Security Record”

Verified

Nominee: OpenAI Inc. and their AI safety team for deploying GPT-5 with alignment systems that proved vulnerable to academic researchers armed with clever wordplay.

Reported by: Dr. Sergey Berezin (NLP Data Scientist) via LinkedIn and published research at ACL 2025 - August 7, 2025.

The Innovation

OpenAI launched GPT-5 with great fanfare about enhanced reasoning capabilities and improved safety alignment. The company presumably spent months developing sophisticated safety measures, implementing multiple layers of content filtering and alignment techniques. Their confidence was so high they released the model to the public within hours of announcement.

The Academic Catastrophe

Just one hour after GPT-5's release, Dr. Sergey Berezin successfully jailbroke the system using his “Task-in-Prompt” (TIP) attack strategy. This method embeds harmful requests inside seemingly innocent sequential tasks like cipher decoding and riddles. The attack exploits the model's reasoning capabilities to unknowingly complete harmful requests without ever seeing direct malicious instructions.

Why They're Nominated

This represents the perfect storm of AI overconfidence meeting rigorous academic research. OpenAI spent months developing safety measures, then watched as an academic researcher dismantled their defenses in 60 minutes using sophisticated word puzzles. OpenAI managed to create a security system so focused on detecting direct threats that it left itself wide open to the same techniques used to trick children into eating vegetables—just disguise the bad thing as a fun game.

Sources: Sergey Berezin LinkedIn Post | ACL 2025 Paper: “The TIP of the Iceberg” | PHRYGE Benchmark Research