Companion package proposal: textattack-detection plugin (content-layer defense recipe, out-of-tree)

Hi @vlwk + @qiyanjun — I know TextAttack has been mostly in maintenance mode since v0.3 and that Jack moved to Cornell. So I'm not asking for any core changes; this proposal is **deliberately shaped as an out-of-tree companion package** that depends on TextAttack and adds a detection lane.

The proposal: a `textattack-detection-atr` PyPI package (MIT, no CLA) that exposes a `Detector` interface compatible with TextAttack's `Attacker` and a reference implementation backed by ATR rules. Users opt in via:

```python
pip install textattack-detection-atr
```

```python
from textattack_detection_atr import ATRDetectionRecipe
from textattack import Attacker

attack = TextFoolerJin2019.build(model_wrapper)
attacker = Attacker(attack, dataset, detector=ATRDetectionRecipe())
result = attacker.attack_dataset()

# Existing TextAttack metrics unchanged
print(result.attack_success_rate)

# New
print(result.bypass_rate)  # ASR among attacks the detector didn't catch
```

Zero changes to `QData/TextAttack`. You don't approve anything, you don't maintain anything, your CI doesn't take on test surface. The only ask is:

- **A line in the README** under a "Defense / detection extensions" section pointing at the companion package, so users can discover it.

That's it.

## Why this exists

A meaningful chunk of TextAttack's actual usage is in undergraduate / graduate NLP security curricula. Students run TextFooler / BERT-Attack / etc. against a victim model and conclude "the attack succeeded N% of the time." There's no built-in step that asks "and would a content-layer defense have caught it?" — which is the question every security course actually wants to answer.

A defense-evaluation companion module gives that question a one-line answer. It's the kind of extension that helps a foundational framework stay relevant in curricula even when active development has slowed.

## About the detector

[Agent-Threat-Rule/agent-threat-rules](https://github.com/Agent-Threat-Rule/agent-threat-rules) is MIT-licensed, 344 rules, 6-check pre-merge quality gate including 0-FP on a 1,941-sample benign + research-mention corpus. The text-classification rules in the pack cover prompt injection, jailbreak phrasings, and a chunk of the TextFooler / BERT-Attack canonical shapes — though honestly, the framework's strength against textual adversarial attacks is regex against well-known shapes, not novel paraphrases. We're upfront about this in the package README.

## What I'm asking

Just a README cross-link. The companion package lives at `Agent-Threat-Rule/textattack-detection-atr` (already initialized; I can have v0.1 published to PyPI within a week of a green light from you).

If you'd rather not link it from the main README, that's a fine no. The package will still exist on PyPI for anyone who searches.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Companion package proposal: textattack-detection plugin (content-layer defense recipe, out-of-tree) #824

Why this exists

About the detector

What I'm asking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Companion package proposal: textattack-detection plugin (content-layer defense recipe, out-of-tree) #824

Description

Why this exists

About the detector

What I'm asking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions