Skip to content

Companion package proposal: textattack-detection plugin (content-layer defense recipe, out-of-tree) #824

@eeee2345

Description

@eeee2345

Hi @vlwk + @qiyanjun — I know TextAttack has been mostly in maintenance mode since v0.3 and that Jack moved to Cornell. So I'm not asking for any core changes; this proposal is deliberately shaped as an out-of-tree companion package that depends on TextAttack and adds a detection lane.

The proposal: a textattack-detection-atr PyPI package (MIT, no CLA) that exposes a Detector interface compatible with TextAttack's Attacker and a reference implementation backed by ATR rules. Users opt in via:

pip install textattack-detection-atr
from textattack_detection_atr import ATRDetectionRecipe
from textattack import Attacker

attack = TextFoolerJin2019.build(model_wrapper)
attacker = Attacker(attack, dataset, detector=ATRDetectionRecipe())
result = attacker.attack_dataset()

# Existing TextAttack metrics unchanged
print(result.attack_success_rate)

# New
print(result.bypass_rate)  # ASR among attacks the detector didn't catch

Zero changes to QData/TextAttack. You don't approve anything, you don't maintain anything, your CI doesn't take on test surface. The only ask is:

  • A line in the README under a "Defense / detection extensions" section pointing at the companion package, so users can discover it.

That's it.

Why this exists

A meaningful chunk of TextAttack's actual usage is in undergraduate / graduate NLP security curricula. Students run TextFooler / BERT-Attack / etc. against a victim model and conclude "the attack succeeded N% of the time." There's no built-in step that asks "and would a content-layer defense have caught it?" — which is the question every security course actually wants to answer.

A defense-evaluation companion module gives that question a one-line answer. It's the kind of extension that helps a foundational framework stay relevant in curricula even when active development has slowed.

About the detector

Agent-Threat-Rule/agent-threat-rules is MIT-licensed, 344 rules, 6-check pre-merge quality gate including 0-FP on a 1,941-sample benign + research-mention corpus. The text-classification rules in the pack cover prompt injection, jailbreak phrasings, and a chunk of the TextFooler / BERT-Attack canonical shapes — though honestly, the framework's strength against textual adversarial attacks is regex against well-known shapes, not novel paraphrases. We're upfront about this in the package README.

What I'm asking

Just a README cross-link. The companion package lives at Agent-Threat-Rule/textattack-detection-atr (already initialized; I can have v0.1 published to PyPI within a week of a green light from you).

If you'd rather not link it from the main README, that's a fine no. The package will still exist on PyPI for anyone who searches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions