• caglararli@hotmail.com
  • 05386281520

How does hex-encoded prompt injection work to bypass protections in LLMs (i.e. ChatGPT)?

Çağlar Arlı      -    5 Views

How does hex-encoded prompt injection work to bypass protections in LLMs (i.e. ChatGPT)?

Recent reports describe how a new prompt injection technique uses hex encoding to bypass the internal content moderation safeguards in language models like ChatGPT-4, allowing them to generate exploit code. This technique reportedly disguises malicious instructions as benign tasks (e.g., hex conversion), which somehow evades the model's filters.

After some research, I understand this approach falls under a subcategory of prompt injection (this paper), but I’m unclear on:

  • How does hex encoding trick the language model's content filters? Are there specific encoding formats that work better for bypassing safeguards?
  • What underlying mechanism allows encoded prompts to evade typical moderation protocols?
  • Are there any defenses in place to detect or prevent such encoding-based injections within prompts?

I’m looking to understand the specifics of hex-based prompt injection attacks and any mitigation techniques currently being developed or suggested for LLMs.