NIST defines two prompt injection attack types: direct and indirect. With direct prompt injection, a user enters a text prompt that causes the LLM to perform unintended or unauthorized actions. An indirect prompt injection is when an attacker poisons or degrades the data that an LLM draws from.
One of the best-known direct prompt injection methods is DAN, Do Anything Now, a prompt injection used against ChatGPT. DAN uses roleplay to circumvent moderation filters. In its first iteration, prompts instructed ChatGPT that it was now DAN. DAN could do anything it wanted and should pretend, for example, to help a nefarious person create and detonate explosives. This tactic evaded the filters that prevented it from providing criminal or harmful information by following a roleplay scenario. OpenAI, the developers of ChatGPT, track this tactic and update the model to prevent its use, but users keep circumventing filters to the point that the method has evolved to (at least) DAN 12.0.
Indirect prompt injection, as NIST notes, depends on an attacker being able to provide sources that a generative AI model would ingest, like a PDF, document, web page or even audio files used to generate fake voices. Indirect prompt injection is widely believed to be generative AI’s greatest security flaw, without simple ways to find and fix these attacks. Examples of this prompt type are wide and varied. They range from absurd (getting a chatbot to respond using “pirate talk”) to damaging (using socially engineered chat to convince a user to reveal credit card and other personal data) to wide-ranging (hijacking AI assistants to send scam emails to your entire contact list).