← Explore Research Questions
Resource

Adversarial testing for Generative AI

Google’s guide defining adversarial testing as systematically evaluating ML models against malicious or inadvertently harmful input, covering explicit queries (containing policy-violating language) and implicit queries (seeming harmless but involving sensitive topics). The four-stage workflow involves identifying testing inputs, creating adversarial datasets targeting edge cases, generating and annotating outputs using safety classifiers and human raters, and reporting findings to guide improvements like fine-tuning, filters, or blocklists.