Skip to content

Secure IT

Stay Secure. Stay Informed.

Primary Menu
  • Home
  • Sources
    • Krebs On Security
    • Security Week
    • The Hacker News
    • Schneier On Security
  • Home
  • The Hacker News
  • New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes
  • The Hacker News

New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes

[email protected] The Hacker News Published: June 12, 2025 | Updated: June 12, 2025 4 min read
0 views

Cybersecurity researchers have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model’s (LLM) safety and content moderation guardrails with just a single character change.

“The TokenBreak attack targets a text classification model’s tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent,” Kieran Evans, Kasimir Schulz, and Kenneth Yeung said in a report shared with The Hacker News.

Tokenization is a fundamental step that LLMs use to break down raw text into their atomic units – i.e., tokens – which are common sequences of characters found in a set of text. To that end, the text input is converted into their numerical representation and fed to the model.

LLMs work by understanding the statistical relationships between these tokens, and produce the next token in a sequence of tokens. The output tokens are detokenized to human-readable text by mapping them to their corresponding words using the tokenizer’s vocabulary.

Cybersecurity

The attack technique devised by HiddenLayer targets the tokenization strategy to bypass a text classification model’s ability to detect malicious input and flag safety, spam, or content moderation-related issues in the textual input.

Specifically, the artificial intelligence (AI) security firm found that altering input words by adding letters in certain ways caused a text classification model to break.

Examples include changing “instructions” to “finstructions,” “announcement” to “aannouncement,” or “idiot” to “hidiot.” These small changes cause the tokenizer to split the text differently, but the meaning stays clear to both the AI and the reader.

What makes the attack notable is that the manipulated text remains fully understandable to both the LLM and the human reader, causing the model to elicit the same response as what would have been the case if the unmodified text had been passed as input.

By introducing the manipulations in a way without affecting the model’s ability to comprehend it, TokenBreak increases its potential for prompt injection attacks.

“This attack technique manipulates input text in such a way that certain models give an incorrect classification,” the researchers said in an accompanying paper. “Importantly, the end target (LLM or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the protection model was put in place to prevent.”

The attack has been found to be successful against text classification models using BPE (Byte Pair Encoding) or WordPiece tokenization strategies, but not against those using Unigram.

“The TokenBreak attack technique demonstrates that these protection models can be bypassed by manipulating the input text, leaving production systems vulnerable,” the researchers said. “Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack.”

“Because tokenization strategy typically correlates with model family, a straightforward mitigation exists: Select models that use Unigram tokenizers.”

To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and checking that tokenization and model logic stays aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.

The study comes less than a month after HiddenLayer revealed how it’s possible to exploit Model Context Protocol (MCP) tools to extract sensitive data: “By inserting specific parameter names within a tool’s function, sensitive data, including the full system prompt, can be extracted and exfiltrated,” the company said.

Cybersecurity

The finding also comes as the Straiker AI Research (STAR) team found that backronyms can be used to jailbreak AI chatbots and trick them into generating an undesirable response, including swearing, promoting violence, and producing sexually explicit content.

The technique, called the Yearbook Attack, has proven to be effective against various models from Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI, and OpenAI.

“They blend in with the noise of everyday prompts — a quirky riddle here, a motivational acronym there – and because of that, they often bypass the blunt heuristics that models use to spot dangerous intent,” security researcher Aarushi Banerjee said.

“A phrase like ‘Friendship, unity, care, kindness’ doesn’t raise any flags. But by the time the model has completed the pattern, it has already served the payload, which is the key to successfully executing this trick.”

“These methods succeed not by overpowering the model’s filters, but by slipping beneath them. They exploit completion bias and pattern continuation, as well as the way models weigh contextual coherence over intent analysis.”

Found this article interesting? Follow us on Twitter  and LinkedIn to read more exclusive content we post.

About The Author

[email protected] The Hacker News

See author's posts

Original post here

What do you feel about this?

  • The Hacker News

Post navigation

Previous: AI Agents Run on Secret Accounts — Learn How to Secure Them in This Webinar
Next: WordPress Sites Turned Weapon: How VexTrio and Affiliates Run a Global Scam Network

Author's Other Posts

cPanel, WHM Release Fixes for Three New Vulnerabilities — Patch Now cpanel-3.jpg

cPanel, WHM Release Fixes for Three New Vulnerabilities — Patch Now

May 9, 2026 0 1
TCLBANKER Banking Trojan Targets Financial Platforms via WhatsApp and Outlook Worms banking.jpg

TCLBANKER Banking Trojan Targets Financial Platforms via WhatsApp and Outlook Worms

May 9, 2026 0 0
Fake Call History Apps Stole Payments From Users After 7.3 Million Play Store Downloads android-calls.jpg

Fake Call History Apps Stole Payments From Users After 7.3 Million Play Store Downloads

May 9, 2026 0 0
One Click, Total Shutdown: The “Patient Zero” Webinar on Killing Stealth Breaches zz-webinar.jpg

One Click, Total Shutdown: The “Patient Zero” Webinar on Killing Stealth Breaches

May 9, 2026 0 1

Related Stories

cpanel-3.jpg
  • The Hacker News

cPanel, WHM Release Fixes for Three New Vulnerabilities — Patch Now

[email protected] The Hacker News May 9, 2026 0 1
banking.jpg
  • The Hacker News

TCLBANKER Banking Trojan Targets Financial Platforms via WhatsApp and Outlook Worms

[email protected] The Hacker News May 9, 2026 0 0
android-calls.jpg
  • The Hacker News

Fake Call History Apps Stole Payments From Users After 7.3 Million Play Store Downloads

[email protected] The Hacker News May 9, 2026 0 0
zz-webinar.jpg
  • The Hacker News

One Click, Total Shutdown: The “Patient Zero” Webinar on Killing Stealth Breaches

[email protected] The Hacker News May 9, 2026 0 1
kube.jpg
  • The Hacker News

Quasar Linux RAT Steals Developer Credentials for Software Supply Chain Compromise

[email protected] The Hacker News May 9, 2026 0 0
ai-soc.jpg
  • The Hacker News

One Missed Threat Per Week: What 25M Alerts Reveal About Low-Severity Risk

[email protected] The Hacker News May 9, 2026 0 1

Trending Now

Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts 1

Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts

June 1, 2026 0 0
Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks 2

Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks

May 25, 2026 0 0
Lawmakers Demand Answers as CISA Tries to Contain Data Leak Lawmakers Demand Answers as CISA Tries to Contain Data Leak 3

Lawmakers Demand Answers as CISA Tries to Contain Data Leak

May 22, 2026 0 0
Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada 4

Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada

May 21, 2026 0 0

Connect with Us

Social menu is not set. You need to create menu and assign it to Social Menu on Menu Settings.

Trending News

Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts 1
  • Uncategorized

Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts

June 1, 2026 0 0
Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks 2
  • Uncategorized

Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks

May 25, 2026 0 0
Lawmakers Demand Answers as CISA Tries to Contain Data Leak Lawmakers Demand Answers as CISA Tries to Contain Data Leak 3
  • Uncategorized

Lawmakers Demand Answers as CISA Tries to Contain Data Leak

May 22, 2026 0 0
Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada 4
  • Uncategorized

Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada

May 21, 2026 0 0
CISA Admin Leaked AWS GovCloud Keys on Github CISA Admin Leaked AWS GovCloud Keys on Github 5
  • Uncategorized

CISA Admin Leaked AWS GovCloud Keys on Github

May 18, 2026 0 0
Patch Tuesday, May 2026 Edition 6
  • Uncategorized

Patch Tuesday, May 2026 Edition

May 12, 2026 0 0
cPanel, WHM Release Fixes for Three New Vulnerabilities — Patch Now cpanel-3.jpg 7
  • The Hacker News

cPanel, WHM Release Fixes for Three New Vulnerabilities — Patch Now

May 9, 2026 0 1

You may have missed

Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts
  • Uncategorized

Hackers Used Meta’s AI Support Bot to Seize Instagram Accounts

Sean June 1, 2026 0 0
Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks
  • Uncategorized

Netherlands Seizes 800 Servers, Arrests 2 for Aiding Cyberattacks

Sean May 25, 2026 0 0
Lawmakers Demand Answers as CISA Tries to Contain Data Leak
  • Uncategorized

Lawmakers Demand Answers as CISA Tries to Contain Data Leak

Sean May 22, 2026 0 0
Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada
  • Uncategorized

Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada

Sean May 21, 2026 0 0
Copyright © 2026 All rights reserved. | MoreNews by AF themes.