Hi @vlwk + @qiyanjun — I know TextAttack has been mostly in maintenance mode since v0.3 and that Jack moved to Cornell. So I'm not asking for any core changes; this proposal is deliberately shaped as an out-of-tree companion package that depends on TextAttack and adds a detection lane.
The proposal: a textattack-detection-atr PyPI package (MIT, no CLA) that exposes a Detector interface compatible with TextAttack's Attacker and a reference implementation backed by ATR rules. Users opt in via:
pip install textattack-detection-atr
from textattack_detection_atr import ATRDetectionRecipe
from textattack import Attacker
attack = TextFoolerJin2019.build(model_wrapper)
attacker = Attacker(attack, dataset, detector=ATRDetectionRecipe())
result = attacker.attack_dataset()
# Existing TextAttack metrics unchanged
print(result.attack_success_rate)
# New
print(result.bypass_rate) # ASR among attacks the detector didn't catch
Zero changes to QData/TextAttack. You don't approve anything, you don't maintain anything, your CI doesn't take on test surface. The only ask is:
- A line in the README under a "Defense / detection extensions" section pointing at the companion package, so users can discover it.
That's it.
Why this exists
A meaningful chunk of TextAttack's actual usage is in undergraduate / graduate NLP security curricula. Students run TextFooler / BERT-Attack / etc. against a victim model and conclude "the attack succeeded N% of the time." There's no built-in step that asks "and would a content-layer defense have caught it?" — which is the question every security course actually wants to answer.
A defense-evaluation companion module gives that question a one-line answer. It's the kind of extension that helps a foundational framework stay relevant in curricula even when active development has slowed.
About the detector
Agent-Threat-Rule/agent-threat-rules is MIT-licensed, 344 rules, 6-check pre-merge quality gate including 0-FP on a 1,941-sample benign + research-mention corpus. The text-classification rules in the pack cover prompt injection, jailbreak phrasings, and a chunk of the TextFooler / BERT-Attack canonical shapes — though honestly, the framework's strength against textual adversarial attacks is regex against well-known shapes, not novel paraphrases. We're upfront about this in the package README.
What I'm asking
Just a README cross-link. The companion package lives at Agent-Threat-Rule/textattack-detection-atr (already initialized; I can have v0.1 published to PyPI within a week of a green light from you).
If you'd rather not link it from the main README, that's a fine no. The package will still exist on PyPI for anyone who searches.
Hi @vlwk + @qiyanjun — I know TextAttack has been mostly in maintenance mode since v0.3 and that Jack moved to Cornell. So I'm not asking for any core changes; this proposal is deliberately shaped as an out-of-tree companion package that depends on TextAttack and adds a detection lane.
The proposal: a
textattack-detection-atrPyPI package (MIT, no CLA) that exposes aDetectorinterface compatible with TextAttack'sAttackerand a reference implementation backed by ATR rules. Users opt in via:Zero changes to
QData/TextAttack. You don't approve anything, you don't maintain anything, your CI doesn't take on test surface. The only ask is:That's it.
Why this exists
A meaningful chunk of TextAttack's actual usage is in undergraduate / graduate NLP security curricula. Students run TextFooler / BERT-Attack / etc. against a victim model and conclude "the attack succeeded N% of the time." There's no built-in step that asks "and would a content-layer defense have caught it?" — which is the question every security course actually wants to answer.
A defense-evaluation companion module gives that question a one-line answer. It's the kind of extension that helps a foundational framework stay relevant in curricula even when active development has slowed.
About the detector
Agent-Threat-Rule/agent-threat-rules is MIT-licensed, 344 rules, 6-check pre-merge quality gate including 0-FP on a 1,941-sample benign + research-mention corpus. The text-classification rules in the pack cover prompt injection, jailbreak phrasings, and a chunk of the TextFooler / BERT-Attack canonical shapes — though honestly, the framework's strength against textual adversarial attacks is regex against well-known shapes, not novel paraphrases. We're upfront about this in the package README.
What I'm asking
Just a README cross-link. The companion package lives at
Agent-Threat-Rule/textattack-detection-atr(already initialized; I can have v0.1 published to PyPI within a week of a green light from you).If you'd rather not link it from the main README, that's a fine no. The package will still exist on PyPI for anyone who searches.