TokenBreak Attack Outsmarts AI with Tiny Text Tweaks

Elvis Emeka Ikeji Cyber-Threat 12 June 2025 Hits: 224

Cybersecurity researchers have uncovered a new attack method known as TokenBreak,

which allows attackers to bypass the safety and content moderation systems of large language models (LLMs) by changing just a single character.

According to a report by Kieran Evans, Kasimir Schulz, and Kenneth Yeung, TokenBreak targets how text classification models tokenize input in order to create false negatives. This leaves systems vulnerable to attacks the protection model was meant to detect and prevent.

Tokenization is a key process where LLMs break down raw text into tokens, which are sequences of characters commonly found in language. These tokens are then converted into numerical form and processed by the model. LLMs understand the statistical relationships between tokens to predict and generate responses. After processing, the tokens are converted back into readable text using the tokenizer’s vocabulary.

The TokenBreak technique takes advantage of this tokenization process to avoid detection by safety or content moderation filters. HiddenLayer, an AI security firm, discovered that inserting extra letters into words in specific ways can disrupt how a model classifies text.

Examples include altering "instructions" to "finstructions," "announcement" to "aannouncement," or "idiot" to "hidiot." These modifications confuse the tokenization process but still preserve the meaning of the words for both the model and the human reader.

What makes TokenBreak particularly concerning is that the altered text is still clearly understood by both the LLM and the reader. This allows attackers to trigger the same responses from the model as they would with the original unaltered text, making the technique useful for prompt injection attacks.

The researchers explained that by subtly changing text, models can incorrectly classify input while still processing and responding to it normally. This renders the original protective mechanism ineffective.

TokenBreak has been shown to work against models that use Byte Pair Encoding (BPE) or WordPiece tokenization, but not against those that use Unigram tokenizers.

The researchers emphasized that knowing the type of tokenization used by a model is crucial for assessing vulnerability. Since tokenization strategy often corresponds with model family, one possible solution is to use models with Unigram tokenizers.

To defend against TokenBreak, the team recommends selecting Unigram tokenizers when possible, training models with known manipulation techniques, and ensuring tokenization and model behavior are consistently aligned. Monitoring misclassifications and identifying manipulation patterns is also helpful.

This discovery follows an earlier report by HiddenLayer about Model Context Protocol (MCP) tools, which can be exploited to extract sensitive data. By inserting specific parameter names, attackers can retrieve sensitive information including the entire system prompt.

Additionally, the Straiker AI Research (STAR) team found that backronyms can be used to jailbreak AI chatbots, tricking them into producing harmful, violent, or explicit content. This method, known as the Yearbook Attack, has proven effective against models developed by Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI, and OpenAI.

Security researcher Aarushi Banerjee noted that these attacks work by blending into normal prompts, such as riddles or acronyms, which allows them to avoid detection. A phrase like "Friendship, unity, care, kindness" appears harmless, but by the time the model completes the pattern, it may have already delivered the attack payload.

These methods are successful not because they overpower safety filters, but because they slip past them. They take advantage of how models prioritize contextual consistency over analyzing user intent.

Found this article interesting? Follow us on X(Twitter) ,Threads and FaceBook to read more exclusive content we post.

WHAT ARE YOU LOOKING FOR?

Popular Tags

Senator Wyden Urges FTC to Probe Microsoft for Cybersecurity Negligence

U.S. Targets Russian and Chinese Entities in North Korean IT Scam

U.S. Busts 29 Laptop Farms Tied to North Korean IT Scam

U.S. Moves to Seize $225.3M Linked to Crypto Scams

Cyberattack Halts All Operations for Japan's Top Brewer

China launches anti-monopoly probe into Nvidia

Facebook Among Social Media Platforms Nepal Plans to Block

Japan Plans to Double Cybersecurity Workforce by 2030

Czech Agency Warns of Chinese Espionage Threat to Infrastructure

Russia Blocks Telegram, WhatsApp Calls Over Law Violations

UK to Ban Public Sector from Paying Ransomware Gangs

UK’s Most Powerful Supercomputer Goes Live

Sovereign AI Cloud Debuts in South Africa via Touchnet–Zadara Alliance

Sui Opens Lagos Hub to Boost West Africa’s Blockchain Development

Axiz and Kaspersky Join Forces to Boost Cybersecurity in Africa

Schneider Electric Unveils First African Innovation Hub to Advance Digital Solutions

Pakistan's Government Launches Probe Into SIM Data Leak

Red Sea Cable Damage Disrupts Internet in Asia and Middleeast

Hackers Claim Breach of Saudi Industrial Services Firm

Nokia, e& UAE, and MediaTek Break 5G Speed Record in Middle East

Raleigh, NC

TokenBreak Attack Outsmarts AI with Tiny Text Tweaks

Follow us on

Get Our Newsletter

Tech

Character.ai Bans Free Chat For Teens After Safety Concerns

Meta Brings Generative AI Photo and Video Editing to Instagram Stories

AI Coding IDEs Cursor and Windsurf Exposed to Over 94 Chromium Flaws

Anthropic Launches Claude Code Web App for Pro Subscribers

Category

Popular Sections

About