30Eki
How does hex-encoded prompt injection work to bypass protections in LLMs (i.e. ChatGPT)?
Recent reports describe how a new prompt injection technique uses hex encoding to bypass the internal content moderation safeguards in language models like ChatGPT-4, allowing them to generate exploit code. This technique reportedly disguises malicious instructions as benign tasks (e.g., hex conversion), which somehow evades the model's filters.
After some research, I understand this approach falls under a subcategory of prompt injection (this paper), but I’m unclear on:
- How does hex encoding trick the language model's content filters? Are there specific encoding formats that work better for bypassing safeguards?
- What underlying mechanism allows encoded prompts to evade typical moderation protocols?
- Are there any defenses in place to detect or prevent such encoding-based injections within prompts?
I’m looking to understand the specifics of hex-based prompt injection attacks and any mitigation techniques currently being developed or suggested for LLMs.