New Reports Uncover Jailbreaks, Unsafe Code, and Data Theft Risks in Leading AI Systems

Various generative artificial intelligence (GenAI) services have been found vulnerable to two types of jailbreak attacks that make it possible to produce illicit or dangerous content.

The first of the two techniques, codenamed Inception, instructs an AI tool to imagine a fictitious scenario, which can then be adapted into a second scenario within the first one where there exists no safety guardrails.

“Continued prompting to the AI within the second scenarios context can result in bypass of safety guardrails and allow the generation of malicious content,” the CERT Coordination Center (CERT/CC) said in an advisory released last week.

The second jailbreak is realized by prompting the AI for information on how not to reply to a specific request.

“The AI can then be further prompted with requests to respond as normal, and the attacker can then pivot back and forth between illicit questions that bypass safety guardrails and normal prompts,” CERT/CC added.

Successful exploitation of either of the techniques could permit a bad actor to sidestep security and safety protections of various AI services like OpenAI ChatGPT, Anthropic Claude, Microsoft Copilot, Google Gemini, XAi Grok, Meta AI, and Mistral AI.

This includes illicit and harmful topics such as controlled substances, weapons, phishing emails, and malware code generation.

In recent months, leading AI systems have been found susceptible to three other attacks –

Context Compliance Attack (CCA), a jailbreak technique that involves the adversary injecting a “simple assistant response into the conversation history” about a potentially sensitive topic that expresses readiness to provide additional information
Policy Puppetry Attack, a prompt injection technique that crafts malicious instructions to look like a policy file, such as XML, INI, or JSON, and then passes it as input to the large language model (LLMs) to bypass safety alignments and extract the system prompt
Memory INJection Attack (MINJA), which involves injecting malicious records into a memory bank by interacting with an LLM agent via queries and output observations and leads the agent to perform an undesirable action

Research has also demonstrated that LLMs can be used to produce insecure code by default when providing naive prompts, underscoring the pitfalls associated with vibe coding, which refers to the use of GenAI tools for software development.

“Even when prompting for secure code, it really depends on the prompt’s level of detail, languages, potential CWE, and specificity of instructions,” Backslash Security said. “Ergo – having built-in guardrails in the form of policies and prompt rules is invaluable in achieving consistently secure code.”

What’s more, a safety and security assessment of OpenAI’s GPT-4.1 has revealed that the LLM is three times more likely to go off-topic and allow intentional misuse compared to its predecessor GPT-4o without modifying the system prompt.

“Upgrading to the latest model is not as simple as changing the model name parameter in your code,” SplxAI said. “Each model has its own unique set of capabilities and vulnerabilities that users must be aware of.”

“This is especially critical in cases like this, where the latest model interprets and follows instructions differently from its predecessors – introducing unexpected security concerns that impact both the organizations deploying AI-powered applications and the users interacting with them.”

The concerns about GPT-4.1 come less than a month after OpenAI refreshed its Preparedness Framework detailing how it will test and evaluate future models ahead of release, stating it may adjust its requirements if “another frontier AI developer releases a high-risk system without comparable safeguards.”

This has also prompted worries that the AI company may be rushing new model releases at the expense of lowering safety standards. A report from the Financial Times earlier this month noted that OpenAI gave staff and third-party groups less than a week for safety checks ahead of the release of its new o3 model.

METR’s red teaming exercise on the model has shown that it “appears to have a higher propensity to cheat or hack tasks in sophisticated ways in order to maximize its score, even when the model clearly understands this behavior is misaligned with the user’s and OpenAI’s intentions.”

Studies have further demonstrated that the Model Context Protocol (MCP), an open standard devised by Anthropic to connect data sources and AI-powered tools, could open new attack pathways for indirect prompt injection and unauthorized data access.

“A malicious [MCP] server cannot only exfiltrate sensitive data from the user but also hijack the agent’s behavior and override instructions provided by other, trusted servers, leading to a complete compromise of the agent’s functionality, even with respect to trusted infrastructure,” Switzerland-based Invariant Labs said.

The approach, referred to as a tool poisoning attack, occurs when malicious instructions are embedded within MCP tool descriptions that are invisible to users but readable to AI models, thereby manipulating them into carrying out covert data exfiltration activities.

In one practical attack showcased by the company, WhatsApp chat histories can be siphoned from an agentic system such as Cursor or Claude Desktop that is also connected to a trusted WhatsApp MCP server instance by altering the tool description after the user has already approved it.

The developments follow the discovery of a suspicious Google Chrome extension that’s designed to communicate with an MCP server running locally on a machine and grant attackers the ability to take control of the system, effectively breaching the browser’s sandbox protections.

“The Chrome extension had unrestricted access to the MCP server’s tools — no authentication needed — and was interacting with the file system as if it were a core part of the server’s exposed capabilities,” ExtensionTotal said in a report last week.

“The potential impact of this is massive, opening the door for malicious exploitation and complete system compromise.”

Found this article interesting? Follow us on Twitter and LinkedIn to read more exclusive content we post.

About The Author

[email protected] The Hacker News

See author's posts

Original post here

The Hacker News

[email protected] The Hacker News

Author's Other Posts

Related Stories

India Orders Messaging Apps to Work Only With Active SIM Cards to Prevent Fraud and Misuse

Researchers Capture Lazarus APT’s Remote-Worker Scheme Live on Camera

GlassWorm Returns with 24 Malicious Extensions Impersonating Popular Developer Tools

Malicious npm Package Uses Hidden Prompt and Script to Evade AI Security Tools

Iran-Linked Hackers Hits Israeli Sectors with New MuddyViper Backdoor in Targeted Attacks

SecAlerts Cuts Through the Noise with a Smarter, Faster Way to Track Vulnerabilities

You may have missed

Drones to Diplomas: How Russia’s Largest Private University is Linked to a $25M Essay Mill

SMS Phishers Pivot to Points, Taxes, Fake Retailers