Skip to content

Add bandit security checks and fix B615 HuggingFace unsafe download warnings#1431

Draft
Copilot wants to merge 9 commits into
mainfrom
copilot/issue-1430-add-bandit-checks
Draft

Add bandit security checks and fix B615 HuggingFace unsafe download warnings#1431
Copilot wants to merge 9 commits into
mainfrom
copilot/issue-1430-add-bandit-checks

Conversation

Copilot AI commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Adds bandit to CI and resolves all B615 (HuggingFace unsafe download without revision pinning) findings across the codebase. Also moves B202, B310, B311, B614, B105, and B107 suppressions from the global skips list to location-specific # nosec comments. Adds # NOSONAR to all pseudo-random call sites to suppress SonarQube S2245 warnings.

What do these changes do

  • Adds bandit>=1.9.4 to dev dependencies and a [tool.bandit] config section in pyproject.toml
  • Adds .github/workflows/bandit.yml CI workflow running bandit -r pythainlp -c pyproject.toml on push/PR to dev and main
  • Adds revision: Optional[str] = None parameter to 12 model-loading functions so callers can pin to a commit hash
  • Adds # nosec B615 to 13 call sites where hardcoded well-known model names are loaded at module level (parameterization impractical)
  • Removes B615 from the global skips list — bandit now enforces it
  • Moves B202, B310, B311, B614, B105, and B107 from the global skips list to location-specific # nosec BXXX comments at each call site; only B110 (try/except/pass structural pattern) remains in the global skips list
  • Adds # NOSONAR to all 14 random.* call sites across 7 files to suppress SonarQube S2245 (pseudo-random number generator) warnings

What was wrong

HuggingFace from_pretrained(), hf_hub_download(), and snapshot_download() calls throughout the codebase had no revision parameter, meaning models are always fetched from the mutable tip of the default branch. This is flagged as CWE-494 (download of code without integrity check).

Additionally, B202, B310, B311, B614, B105, and B107 were suppressed globally via the skips list, making the suppressions invisible and unenforced at the relevant call sites.

The # nosec B311 (Bandit) and # noqa: S311 (Ruff/Flake8) comments at pseudo-random call sites do not suppress SonarQube warnings; SonarQube requires its own # NOSONAR token.

How this fixes it

For functions that accept a user-supplied model name, a revision parameter is threaded through to the underlying HuggingFace call. Bandit's B615 check passes when revision= is bound to a non-constant expression (a variable), even when the default is None. Callers can now pin to a full commit SHA for reproducible, tamper-evident downloads:

# Pin to a specific commit for reproducible, secure downloads
get_hf_hub("airesearch/wangchanberta-base-att-spm-uncased",
           revision="a3b4c5d...")

# Or omit for previous behaviour (latest)
get_hf_hub("airesearch/wangchanberta-base-att-spm-uncased")

For module-level singletons with hardcoded model names (e.g., phayathaibert/core.py, spell/wanchanberta_thai_grammarly.py), # nosec B615 is used since the model identity is fixed in the source.

For the remaining rules, suppressions are now co-located with the code they apply to:

  • B310 (urlopen): two call sites in corpus/core.py, both using HTTPS with SSL verification enabled
  • B202 (extractall): tar.extractall() and zip_file.extractall() in corpus/core.py, each preceded by explicit path-traversal validation
  • B311 (pseudo-random): 14 random.* call sites across 7 files — all non-security NLP uses; each line now carries # noqa: S311 # nosec B311 # NOSONAR to suppress Ruff, Bandit, and SonarQube respectively
  • B614 (torch.load): two call sites in transliterate/thai2rom.py and transliterate/thaig2p.py, loading trusted internal corpus model files
  • B105 (hardcoded password string): generate/thai2fit.py ("pad_token": 1 config value) and tokenize/_utils.py (empty string accumulator)
  • B107 (hardcoded password default): phayathaibert/core.py (space_token default) and translate/tokenization_small100.py (tokenizer special tokens such as <s>, </s>, <pad>, <unk>)

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test

Copilot AI added 2 commits June 10, 2026 09:48
- Add bandit>=1.9.4 to the dev dependency group in pyproject.toml
- Add [tool.bandit] config section with skips for false positives and
  accepted risks (B105, B107, B110, B202, B310, B311, B614, B615)
- Add .github/workflows/bandit.yml workflow triggered on push/PR to
  both dev and main branches
- Add # nosec B324 to MD5 hash line in corpus/core.py (already had
  # noqa: S324 for ruff; this suppression is for bandit)

All issues accounted for: bandit now runs cleanly with 0 findings.
Copilot AI added 2 commits June 10, 2026 11:17
Add revision parameter to all user-supplied model loaders so that
callers can pin to a specific commit hash for reproducible and secure
downloads. Use # nosec B615 for module-level singletons with
hardcoded well-known model names where parameterization is impractical.
Remove B615 from the global bandit skips list.

Files with new revision parameter:
- pythainlp/corpus/core.py (get_hf_hub)
- pythainlp/generate/wangchanglm.py (WangChanGLM.load_model)
- pythainlp/lm/qwen3.py (Qwen3.load_model)
- pythainlp/parse/transformers_ud.py (Parse.__init__)
- pythainlp/parse/ud_goeswith.py (Parse.__init__)
- pythainlp/phayathaibert/core.py (PartOfSpeechTagger, NamedEntityTagger)
- pythainlp/summarize/mt5.py (mT5Summarizer.__init__)
- pythainlp/tag/pos_tag.py (pos_tag_transformers)
- pythainlp/translate/small100.py (Small100Translator.__init__)
- pythainlp/translate/th_fr.py (ThFrTranslator.__init__)
- pythainlp/translate/zh_th.py (ThZhTranslator, ZhThTranslator)
- pythainlp/wangchanberta/core.py (NamedEntityRecognition.__init__)

Files with # nosec B615 (hardcoded model names):
- pythainlp/augment/lm/phayathaibert.py
- pythainlp/augment/lm/wangchanberta.py
- pythainlp/phayathaibert/core.py (module-level singleton)
- pythainlp/spell/wanchanberta_thai_grammarly.py
- pythainlp/wangchanberta/core.py (hardcoded revision="main")

bandit now reports 0 B615 findings (11 lines suppressed via # nosec).
The revision parameter belongs on cached_file() (to fetch the right
version from HuggingFace Hub), not on AutoConfig.from_pretrained()
which receives the resulting local path. Add # nosec B615 there since
the revision is already handled upstream by cached_file().
Copilot AI changed the title Add bandit security checks (resolves #1430) Add bandit security checks and fix B615 HuggingFace unsafe download warnings Jun 10, 2026
Copilot finished work on behalf of bact June 10, 2026 11:21
@bact bact added github_actions Pull requests that update GitHub Actions code security tests Unit test, code coverage, test case labels Jun 12, 2026
@bact

bact commented Jun 12, 2026

Copy link
Copy Markdown
Member

@copilot

When possible the B202, B310, B311, and B614 exceptions should be a location-specific exception, not a catch-all exception at the configuration level.

Copilot AI commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@copilot

When possible the B202, B310, B311, and B614 exceptions should be a location-specific exception, not a catch-all exception at the configuration level.

Done. B202, B310, B311, and B614 have been removed from the global skips list in pyproject.toml and replaced with location-specific # nosec BXXX comments at each call site:

  • B310: two urlopen(...) calls in corpus/core.py
  • B202: tar.extractall(...) and zip_file.extractall(...) in corpus/core.py
  • B311: all random.* calls across augment/lm/phayathaibert.py, generate/core.py, generate/thai2fit.py, phayathaibert/core.py, tools/misspell.py, transliterate/thai2rom.py, and transliterate/thaig2p.py
  • B614: two torch.load(...) calls in transliterate/thai2rom.py and transliterate/thaig2p.py

Bandit now reports zero issues with 33 location-specific suppressions tracked. Commit: Move B202/B310/B311/B614 from global skips to location-specific nosec comments

Copilot finished work on behalf of bact June 12, 2026 11:19
@bact

bact commented Jun 12, 2026

Copy link
Copy Markdown
Member

@copilot The Bandit B105 and B107 exceptions should be location-specific too when possible.

Copilot AI commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@copilot The Bandit B105 and B107 exceptions should be location-specific too when possible.

Done. B105 and B107 have been removed from the global skips list and replaced with location-specific # nosec comments:

  • B105: generate/thai2fit.py line 100 ("pad_token": 1), tokenize/_utils.py line 59 (empty string accumulator)
  • B107: phayathaibert/core.py line 123 (space_token: str = "<_>"), translate/tokenization_small100.py lines 152–156 (tokenizer special tokens <s>, </s>, <pad>, <unk>)

The global skips list now only retains B110 (try/except/pass), which is a broad structural pattern not tied to individual call sites.

Copilot finished work on behalf of bact June 12, 2026 11:45
@bact

bact commented Jun 13, 2026

Copy link
Copy Markdown
Member

@copilot SonarQube scan still warns about 2 locations of pseudo random number generator. Looks like the # noqa: S311 # nosec B311 comment is not enough? How to silenced it?

Copilot AI commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@copilot SonarQube scan still warns about 2 locations of pseudo random number generator. Looks like the # noqa: S311 # nosec B311 comment is not enough? How to silenced it?

SonarQube uses # NOSONAR for inline suppression — # nosec is Bandit-only and # noqa is Ruff/Flake8-only; SonarQube ignores both. Added # NOSONAR to all 14 random.* call sites across transliterate/thai2rom.py, transliterate/thaig2p.py, phayathaibert/core.py, augment/lm/phayathaibert.py, generate/core.py, generate/thai2fit.py, and tools/misspell.py. Commit: Add # NOSONAR to all pseudo-random call sites to silence SonarQube S2245 warnings

Copilot finished work on behalf of bact June 13, 2026 05:35
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github_actions Pull requests that update GitHub Actions code security tests Unit test, code coverage, test case

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants