← Explore Research Questions
Resource
Adversarial testing for Generative AI
Google’s guide defining adversarial testing as systematically evaluating ML models against malicious or inadvertently harmful input, covering explicit queries (containing policy-violating language) and implicit queries (seeming harmless but involving sensitive topics). The four-stage workflow involves identifying testing inputs, creating adversarial datasets targeting edge cases, generating and annotating outputs using safety classifiers and human raters, and reporting findings to guide improvements like fine-tuning, filters, or blocklists.
Related Research Questions
How can we distinguish between legitimate persuasion and manipulative influence in deliberative settings?
Urgent
What behavioral indicators reliably signal attempts to game deliberative processes?
Urgent
How can we design information presentation formats that minimize susceptibility to framing effects?
Urgent
What are the tradeoffs between openness/transparency and manipulation resistance?
Urgent
How do we prevent gaming or manipulation of AI backup systems?
How can we develop real-time detection systems for coordinated manipulation attempts during participant recruitment and selection?
Urgent