Shared: Improvements to SensitiveDataHeuristics.qll#21806
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the shared sensitive-data naming heuristics used across multiple languages to improve classification of passwords/private information (increasing true positives and reducing false positives), and refreshes language-specific tests and change notes to reflect the updated behavior.
Changes:
- Refines shared sensitive-data heuristics (regex patterns and exclusions) in
SensitiveDataHeuristics.qll. - Updates Swift/Python/Rust tests and expected baselines to reflect newly-detected (or no-longer-detected) sensitive data sources.
- Adds per-language change notes documenting the heuristics improvement.
Show a summary per file
| File | Description |
|---|---|
| shared/concepts/codeql/concepts/internal/SensitiveDataHeuristics.qll | Updates shared sensitive-data heuristic patterns/exclusions used by multiple languages. |
| rust/ql/test/library-tests/sensitivedata/test.rs | Extends Rust library test coverage for the updated sensitive-data name heuristics. |
| python/ql/test/query-tests/Security/CWE-312-CleartextLogging/test.py | Updates Python cleartext logging test to reflect newly-classified sensitive values. |
| swift/ql/test/query-tests/Security/CWE-328/testCryptoKit.swift | Extends Swift hashing tests to cover additional API spellings. |
| swift/ql/test/query-tests/Security/CWE-311/testSend.swift | Updates Swift transmission test to reflect newly-detected sensitive field. |
| swift/ql/test/query-tests/Security/CWE-328/WeakSensitiveDataHashing.expected | Updates Swift expected results baseline for weak sensitive data hashing. |
| swift/ql/test/query-tests/Security/CWE-328/WeakPasswordHashing.expected | Updates Swift expected results baseline for weak password hashing. |
| swift/ql/test/query-tests/Security/CWE-311/SensitiveExprs.expected | Updates Swift expected sensitive-expression baseline. |
| swift/ql/test/query-tests/Security/CWE-311/CleartextTransmission.expected | Updates Swift expected cleartext transmission baseline. |
| swift/ql/lib/change-notes/2026-05-14-sensitive-data.md | Adds Swift change note for the sensitive-data heuristics update. |
| rust/ql/lib/change-notes/2026-05-14-sensitive-data.md | Adds Rust change note for the sensitive-data heuristics update. |
| python/ql/lib/change-notes/2026-05-14-sensitive-data.md | Adds Python change note for the sensitive-data heuristics update. |
| javascript/ql/lib/change-notes/2026-05-14-sensitive-data.md | Adds JavaScript change note for the sensitive-data heuristics update. |
Copilot's findings
- Files reviewed: 14/14 changed files
- Comments generated: 5
| result = | ||
| "(?is).*(pass(wd|word|code|.?phrase)(?!.*question)|(auth(entication|ori[sz]ation)?).?key|oauth|" | ||
| + "api.?(key|token)|([_-]|\\b)mfa([_-]|\\b)).*" | ||
| + "api.?(key|tok)|([_-]|\\b)mfa([_-]|\\b)).*" |
There was a problem hiding this comment.
This no longer accepts token, e.g. api-token but does accept accepts api-tok, which seems somewhat strange.
Should tok be tok(en)?
There was a problem hiding this comment.
It will accept api-token because the regex is followed by .*, so we're effectively matching a substring here. There are a couple of test cases for rust that examine this:
sink(api_token); // $ sensitive=password
sink(api_tok); // $ sensitive=password
| // Financial data - such as credit card numbers, salary, bank accounts, and debts | ||
| "(credit|debit|bank|visa).?(card|num|no|acc(ou)?nt)|acc(ou)?nt.?(no|num|credit)|routing.?num|" | ||
| "(credit|debit|bank|visa).?(card|num|no|acc(ou)?nt)|(card|acc(ou)?nt).?(no|num|credit)|routing.?num|" | ||
| + "salary|billing|beneficiary|credit.?(rating|score)|([_-]|\\b)(ccn|cvv|iban)([_-]|\\b)|" + |
There was a problem hiding this comment.
Nit: The new regex accepts strings like cardCredit, which the old one did not.
There was a problem hiding this comment.
Yeah, I thought about this case, and decided (1) it's unlikely to come up but more importantly (2) if someone has a variable called cardCredit, there's a very good chance that's sensitive data anyway (e.g. the amount of credit someone has on a card?).
| "(?is).*([^\\w$.-]|redact|censor|obfuscate|hash|md5|sha|random|((?<!un)(en))?(crypt|(?<!pass)code)|" | ||
| + "certain|concert|secretar|account(ant|ab|ing|ed)|file|path|([_-]|\\b)url).*" | ||
| "(?is).*([^\\w$.-]|redact|censor|obfuscate|hash|md5|sha|random|(?<!unen)crypt|(?<!un)encode|" + | ||
| "certain|concert|secretar|wildcard|coauthor|account(ant|ab|ing|ed)|(?<!pro)file|path|([_-]|\\b)url).*" |
There was a problem hiding this comment.
The new regex no longer accepts unencrypt. Don't know if that's on purpose.
There was a problem hiding this comment.
Yes that's on purpose - the original was supposed to match crypt and encrypt but not unencrypt, but it actually did match unencrypt (via ignoring the optional bit and just matching .*crypt.*). The new version matches encrypt and crypt but not unencrypt.
|
Thanks for the reviews, I'm going to merge this now but I'm happy to respond to any further comments post-merge. |
This PR consists of a series of small improvements to
SensitiveDataHeuristics.qll, intended to find more true and less false sources of sensitive data. One of these changes addresses a request from a user, the rest are motivated by issues we've spotted at various points in the past. None are expected to have a big impact by themselves (but 7 changes x 5 affected languages is quite a lot of surface area).card.?no,api.?tok,security.?codepatterns. We already had similar cases but no exact coverage for these.wildcard_nois notcard.?no;profileis notfile;coauthoris notoauth.security_codefor containingcode. It was also handlingunencryptedincorrectly - whileunencryptwas not matched due to the special case, thecryptsubstring was matched due to the entireunenpart of the regex being optional. Copilot gets most of the credit for spotting this one.Draft PR because I need to:accountmatches a bit widely and we could potentially add a “not sensitive” rule forvalidator, if we see more of either of these cases.