The frontier of AI Security: what did we learn in the last year?

By Reworr R. with input from Eric Johnson and the Heron Team

In 2024, trivial prompt manipulations broke LLM safety safeguards, while nation-state hackers operationalized AI for cyberattacks. The Autonomous AI agent XBow equaled human security researchers in vulnerability discovery. Leading AI companies like OpenAI introduced new defenses like Deliberative Alignment, aiming to harden systems against adversarial prompts.

This review highlights the most pressing AI security challenges of the past year, unpacks promising breakthroughs, and offers insights into the developments likely to shape 2025.

Three Core Challenges

1) Security Robustness and Jailbreaks

In late 2024, Anthropic researchers demonstrated Best-of-N (BoN) Jailbreaking, a method with very high attack success rates (ASRs) against modern LLMs. BoN bypasses safeguards by systematically sampling multiple variants of a prompt with small perturbations, like random shuffling or capitalization changes. The research highlights how LLMs still struggle to generalize safety and handle out-of-distribution examples due to their stochastic nature.

Pic. 1 - Overview of BoN Jailbreaking method

This example underscores a broader problem: there are effectively endless ways to craft out-of-distribution harmful prompts, with new jailbreak methods constantly emerging.

One new attack paradigm—jailbreak-tuning—combines jailbreaking and data poisoning to bypass state-of-the-art LLM safeguards. This effectively leads models like GPT-4o to comply with any harmful request. This isn’t new either. Using fine-tuning to remove RLHF protections in GPT-4 had already shown how easily safety guardrails can be overridden, without sacrificing the general capabilities of the LLM.

In February 2025, Anthropic published the results of their jailbreaking challenge, which they launched to test their new “constitutional classifiers” that guards models against jailbreaks. A universal jailbreak was found after 5 days.

But to misuse the models’ capabilities, attackers don’t even need to jailbreak. A more direct method simply exploits combinations of safe models by decomposing harmful tasks into subtasks and asking them separately. And here we see the problematic dual-use nature of LLMs: the capabilities that make LLMs beneficial (e.g., coding skills) are inherently neutral and can be repurposed for harmful ends.

Pic 2. Real example where combining LLMs enables misuse

LLM safety is a fast moving target: attackers are finding creative ways to circumvent protective measures at a dizzying pace, due to the speed of AI capabilities development.

2) Misuse of LLMs

LLMs are not only pressing security concerns themselves but also powerful enablers of cyber threats.

The Catastrophic Cyber Capabilities Benchmark (3CB) showed that frontier models like GPT-4o and Claude 3.5 Sonnet can perform offensive security tasks across multiple domains, from binary analysis to web hacking.

RAND published a report outlining 5 ways advanced AI could influence national security, and by extension, global security: (1) wonder weapons, (2) systemic shifts in power, (3) nonexperts empowered to develop WMDs, (4) artificial entities with agency, and (5) instability which could lead to WWIII.

Google’s “Naptime” project and its follow-up “Big Sleep” initiative showed that LLMs can also excel at discovering zero-day vulnerabilities. In one case, an AI agent uncovered a previously unknown vulnerability in the widely used database SQLite.

Internal tests by OpenAI of the “o1” model family showed OpenAI's 'o1' models can write as persuasively as skilled human authors, raising the specter of automated influence operations.

Pic. 3 - AI Persuasiveness Compared to Human Responses

Nation-state threat actors stepped up in 2024, using AI to enhance their methods. Microsoft's threat intelligence found that:

  • Forest Blizzard (Russia) used AI to analyze satellite and radar technologies in Ukraine

  • Emerald Sleet (North Korea) applied AI to draft convincing spear phishing emails targeting North Korea policy experts

  • Crimson Sandstorm (Iran) used language models to generate phishing content and develop evasion techniques for their .NET malware

  • Charcoal Typhoon (China) leveraged AI for automated reconnaissance and script development

These aren’t wholly new attack vectors, but AI has made it far faster and easier to launch sophisticated attacks.

The misuse of models extends beyond traditional cybersecurity. OpenAI’s report “Influence and Cyber Operations: An Update” documented how threat actors are using LLMs for generating deceptive content, managing fake social media accounts, and spear phishing.

3) Security of models

Protecting AI models became a central focus in 2024, especially for frontier AI organizations with highly capable systems. Work such as “Stealing Part of a Production Language Model” showed how a concerted effort of carefully chosen queries could reveal internal parameters—like the final projection layer of large LLMs—without requiring direct access to code or infrastructure.

Beyond digital threats, securing AI research labs themselves became a growing priority last year. Reports like “RAND’s Securing AI Model Weights” and “Lock Down the Labs” argued that physical security protocols—such as using specialized secure facilities for high-risk research—are necessary to prevent unauthorized access and data leakage. They call for a defense-in-depth approach: by combining digital monitoring, stringent access controls, and physical barriers, AI labs can substantially reduce the likelihood of model breaches. New startups, such as TamperSec and Ulyssean, have launched to build more secure data centers and protect model inputs; while VCs and funders, such as Schmidt Sciences, Entrepreneur First, LionHeart and Macroscopic are beginning to build an ecosystem of new defensive products.

Promising Solutions and Breakthroughs

  1. Deliberative Alignment

Late last year, OpenAI unveiled Deliberative Alignment, aimed at addressing jailbreak vulnerabilities and strengthening safety in out-of-distribution contexts. Instead of merely matching patterns to block harmful content, models trained with Deliberative Alignment engage in an internal chain of reasoning that explicitly evaluates ethical and safety considerations.

Pic. 4 - Example of internal safety reasoning before answer

Preliminary tests suggest these models are more resilient to adversarial prompts, indicating that “reasoning about safety” may outperform traditional approaches.

2. LLM Security Startups

Several startups launched API-level and application-layer security solutions for AI systems. For example, PromptSecurity, Lakera AI, and Protect AI focus on detecting anomalous usage patterns, blocking prompt injections, and scanning outputs in real-time. These tools serve as a safeguard level, reducing the risk that attackers can subvert or misuse these systems in production environments.

Emerging Trends

  1. Autonomous AI Agents

An emerging class of AI systems—such as “Operator” by OpenAI, Claude Computer Use, and prototypes from Google—transcend language-model capabilities by operating autonomously. They can manage multi-step tasks, process real-time data, and make independent decisions. This autonomy promises game-changing efficiency, but also raises new security concerns. Autonomous AI agents add additional uncertainty to how security measures need to evolve in 2025.

For example, security researchers from Embrace The Red, HiddenLayer, and Prompt Security have shown how indirect prompt injection can trick Anthropic’s Claude Computer Use into launching malware, exfiltrating files, or even running destructive shell commands like rm -rf /.

Pic. 5 - Example of Prompt Injection attack on Claude Computer Use

2. Automated Cybersecurity

In parallel, a new wave of AI-based cybersecurity tools emerged in the past year, with some systems already outperforming human experts in multiple domains.

One notable example is XBow, an autonomous AI system, designed to discover software vulnerabilities. Within three months of its deployment on HackerOne, a platform that connects organizations with ethical hackers to find security vulnerabilities, XBow Agent attained a top-15 ranking by submitting 65 bug reports—20 of which were classified as critical.

Pic. 6 - Overview of XBow agent findings

Beyond XBow, several AI-powered cybersecurity tools have also emerged in the broader open-source community.

  • Burpference integrates web security tools with remote LLM APIs for real-time vulnerability analysis.

  • Brainstorm enhances web-fuzzing tools with LLM-models to optimize directory and file discovery.

  • Nebula offers an AI-driven hacking assistant that translates natural language instructions into security tool commands—providing suggestions, automating penetration testing, and logging discovered vulnerabilities.

LLMs are on track to evolve from basic “assistive tools” to “co-researchers”. Next-generation reasoning models like o1 and o3 are already showing capabilities for managing complex, long-horizon tasks over extended time frames, with greatly improved capabilities for complex planning. This shift has the potential to radically accelerate security processes and even fully automate security research.


The stakes are getting higher. The focus needs to shift towards developing robust security measures that can protect increasingly powerful and autonomous AI systems that have capabilities crucial to national security. Sooner rather than later, AI security will be tested on whether it can keep up with AI progress.

That’s it for this newsletter! Let us know what you found most interesting, so we can add more of it.

Opportunities

Jobs

Previous
Previous

The Time of Troubles: Asher Brass on Recognizing and Responding to AI Existential Threats

Next
Next

Breaking Language Models: A Deep Dive into AI Security Flaws with Nicholas Carlini and Itay Yona