Skip to content

Shared: Improvements to SensitiveDataHeuristics.qll#21806

Merged
geoffw0 merged 22 commits into
github:mainfrom
geoffw0:extsensitive
May 19, 2026
Merged

Shared: Improvements to SensitiveDataHeuristics.qll#21806
geoffw0 merged 22 commits into
github:mainfrom
geoffw0:extsensitive

Conversation

@geoffw0
Copy link
Copy Markdown
Contributor

@geoffw0 geoffw0 commented May 6, 2026

This PR consists of a series of small improvements to SensitiveDataHeuristics.qll, intended to find more true and less false sources of sensitive data. One of these changes addresses a request from a user, the rest are motivated by issues we've spotted at various points in the past. None are expected to have a big impact by themselves (but 7 changes x 5 affected languages is quite a lot of surface area).

  • more TPs: card.?no, api.?tok, security.?code patterns. We already had similar cases but no exact coverage for these.
  • less FPS: wildcard_no is not card.?no; profile is not file; coauthor is not oauth.
  • more TPs: the logic for identifying encrypted / encoded values (based on the variable name) was overly wide, excluding names such as security_code for containing code. It was also handling unencrypted incorrectly - while unencrypt was not matched due to the special case, the crypt substring was matched due to the entire unen part of the regex being optional. Copilot gets most of the credit for spotting this one.

Draft PR because I need to:

  • check CI
  • run and examine DCA (all languages, results and performance)
    • Javascript
    • Python
    • Rust
    • Swift
  • run and examine MRVA 100 runs
    • Rust --- we gain a few hundred pieces of sensitive data across the Rust MRVA-100, I looked at most of them and I’m very happy with what I saw.
    • Python --- we gain nearly five hundred pieces of sensitive data across the Python MRVA-100. I looked at a decent sample (> 40) and found mostly excellent additions, plus a few weak ones, and a few incorrectly labelled results lost. The rule for account matches a bit widely and we could potentially add a “not sensitive” rule for validator, if we see more of either of these cases.
  • add change notes

@geoffw0 geoffw0 added Python Ruby Rust Pull requests that update Rust code Swift javascript Pull requests that update Javascript code labels May 6, 2026
@github-actions github-actions Bot added the Python label May 7, 2026
@github-actions github-actions Bot added the JS label May 14, 2026
@geoffw0 geoffw0 marked this pull request as ready for review May 14, 2026 16:25
@geoffw0 geoffw0 requested review from a team as code owners May 14, 2026 16:25
Copilot AI review requested due to automatic review settings May 14, 2026 16:25
@geoffw0 geoffw0 requested a review from a team as a code owner May 14, 2026 16:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the shared sensitive-data naming heuristics used across multiple languages to improve classification of passwords/private information (increasing true positives and reducing false positives), and refreshes language-specific tests and change notes to reflect the updated behavior.

Changes:

  • Refines shared sensitive-data heuristics (regex patterns and exclusions) in SensitiveDataHeuristics.qll.
  • Updates Swift/Python/Rust tests and expected baselines to reflect newly-detected (or no-longer-detected) sensitive data sources.
  • Adds per-language change notes documenting the heuristics improvement.
Show a summary per file
File Description
shared/concepts/codeql/concepts/internal/SensitiveDataHeuristics.qll Updates shared sensitive-data heuristic patterns/exclusions used by multiple languages.
rust/ql/test/library-tests/sensitivedata/test.rs Extends Rust library test coverage for the updated sensitive-data name heuristics.
python/ql/test/query-tests/Security/CWE-312-CleartextLogging/test.py Updates Python cleartext logging test to reflect newly-classified sensitive values.
swift/ql/test/query-tests/Security/CWE-328/testCryptoKit.swift Extends Swift hashing tests to cover additional API spellings.
swift/ql/test/query-tests/Security/CWE-311/testSend.swift Updates Swift transmission test to reflect newly-detected sensitive field.
swift/ql/test/query-tests/Security/CWE-328/WeakSensitiveDataHashing.expected Updates Swift expected results baseline for weak sensitive data hashing.
swift/ql/test/query-tests/Security/CWE-328/WeakPasswordHashing.expected Updates Swift expected results baseline for weak password hashing.
swift/ql/test/query-tests/Security/CWE-311/SensitiveExprs.expected Updates Swift expected sensitive-expression baseline.
swift/ql/test/query-tests/Security/CWE-311/CleartextTransmission.expected Updates Swift expected cleartext transmission baseline.
swift/ql/lib/change-notes/2026-05-14-sensitive-data.md Adds Swift change note for the sensitive-data heuristics update.
rust/ql/lib/change-notes/2026-05-14-sensitive-data.md Adds Rust change note for the sensitive-data heuristics update.
python/ql/lib/change-notes/2026-05-14-sensitive-data.md Adds Python change note for the sensitive-data heuristics update.
javascript/ql/lib/change-notes/2026-05-14-sensitive-data.md Adds JavaScript change note for the sensitive-data heuristics update.

Copilot's findings

  • Files reviewed: 14/14 changed files
  • Comments generated: 5

Comment thread swift/ql/lib/change-notes/2026-05-14-sensitive-data.md Outdated
Comment thread rust/ql/lib/change-notes/2026-05-14-sensitive-data.md Outdated
Comment thread python/ql/lib/change-notes/2026-05-14-sensitive-data.md Outdated
Comment thread javascript/ql/lib/change-notes/2026-05-14-sensitive-data.md Outdated
result =
"(?is).*(pass(wd|word|code|.?phrase)(?!.*question)|(auth(entication|ori[sz]ation)?).?key|oauth|"
+ "api.?(key|token)|([_-]|\\b)mfa([_-]|\\b)).*"
+ "api.?(key|tok)|([_-]|\\b)mfa([_-]|\\b)).*"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer accepts token, e.g. api-token but does accept accepts api-tok, which seems somewhat strange.

Should tok be tok(en)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will accept api-token because the regex is followed by .*, so we're effectively matching a substring here. There are a couple of test cases for rust that examine this:

	sink(api_token); // $ sensitive=password
	sink(api_tok); // $ sensitive=password

// Financial data - such as credit card numbers, salary, bank accounts, and debts
"(credit|debit|bank|visa).?(card|num|no|acc(ou)?nt)|acc(ou)?nt.?(no|num|credit)|routing.?num|"
"(credit|debit|bank|visa).?(card|num|no|acc(ou)?nt)|(card|acc(ou)?nt).?(no|num|credit)|routing.?num|"
+ "salary|billing|beneficiary|credit.?(rating|score)|([_-]|\\b)(ccn|cvv|iban)([_-]|\\b)|" +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The new regex accepts strings like cardCredit, which the old one did not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about this case, and decided (1) it's unlikely to come up but more importantly (2) if someone has a variable called cardCredit, there's a very good chance that's sensitive data anyway (e.g. the amount of credit someone has on a card?).

"(?is).*([^\\w$.-]|redact|censor|obfuscate|hash|md5|sha|random|((?<!un)(en))?(crypt|(?<!pass)code)|"
+ "certain|concert|secretar|account(ant|ab|ing|ed)|file|path|([_-]|\\b)url).*"
"(?is).*([^\\w$.-]|redact|censor|obfuscate|hash|md5|sha|random|(?<!unen)crypt|(?<!un)encode|" +
"certain|concert|secretar|wildcard|coauthor|account(ant|ab|ing|ed)|(?<!pro)file|path|([_-]|\\b)url).*"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new regex no longer accepts unencrypt. Don't know if that's on purpose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's on purpose - the original was supposed to match crypt and encrypt but not unencrypt, but it actually did match unencrypt (via ignoring the optional bit and just matching .*crypt.*). The new version matches encrypt and crypt but not unencrypt.

Copy link
Copy Markdown
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 👍

@geoffw0
Copy link
Copy Markdown
Contributor Author

geoffw0 commented May 19, 2026

Thanks for the reviews, I'm going to merge this now but I'm happy to respond to any further comments post-merge.

@geoffw0 geoffw0 merged commit 3aa6606 into github:main May 19, 2026
140 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation javascript Pull requests that update Javascript code JS Python Rust Pull requests that update Rust code Swift

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants