Skip to content

update remote config polling to be faster and centralized#6851

Closed
rochdev wants to merge 2 commits into
mainfrom
rochdev/faster-rc-poll
Closed

update remote config polling to be faster and centralized#6851
rochdev wants to merge 2 commits into
mainfrom
rochdev/faster-rc-poll

Conversation

@rochdev
Copy link
Copy Markdown
Member

@rochdev rochdev commented May 4, 2026

Motivation

The main motivation behind this PR was to fix the APPSEC_RUNTIME_ACTIVATION, APPSEC_AUTO_EVENTS_RC and APPSEC_API_SECURITY_RC tests which are extremely slow. Turned out that most of the time was spent waiting on RC polling since it happens infrequently in production to avoid overloading our backend. This limitation does not exist for tests, so they should instead poll as quickly as possible. This makes for example APPSEC_RUNTIME_ACTIVATION go from ~10 minutes to ~2 minutes.

While working on this, I also realized that all tests using RC are already overriding the value, but each test does so on their own. This makes it very easy to forget to do it for a specific test (like the above) which ends up slowing the entire workflow. So I instead removed all the overrides and add a fixed value in a centralized location. This will thus benefit any future tests as well.

Changes

Update remote config polling to be faster and centralized.

I had originally opened #6808 which combined multiple fixes and didn't quite get to the root cause of remote config specifically, so it's superseded by this one and the other fixes will be later split up as well.

Workflow

  1. ⚠️ Create your PR as draft ⚠️
  2. Work on you PR until the CI passes
  3. Mark it as ready for review
    • Test logic is modified? -> Get a review from RFC owner.
    • Framework is modified, or non obvious usage of it -> get a review from R&P team

🚀 Once your PR is reviewed and the CI green, you can merge it!

🛟 #apm-shared-testing 🛟

Reviewer checklist

  • Anything but tests/ or manifests/ is modified ? I have the approval from R&P team
  • A docker base image is modified?
    • the relevant build-XXX-image label is present
  • A scenario is added, removed or renamed?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

CODEOWNERS have been resolved as:

tests/parametric/test_dynamic_configuration.py                          @DataDog/system-tests-core @DataDog/apm-sdk-capabilities
tests/parametric/test_ffe/test_dynamic_evaluation.py                    @DataDog/feature-flagging-and-experimentation-sdk @DataDog/system-tests-core
tests/parametric/test_tracer_flare.py                                   @DataDog/system-tests-core @DataDog/apm-sdk-capabilities
utils/_context/_scenarios/__init__.py                                   @DataDog/system-tests-core
utils/_context/_scenarios/debugger.py                                   @DataDog/system-tests-core
utils/_context/_scenarios/endtoend.py                                   @DataDog/system-tests-core
utils/_context/_scenarios/parametric.py                                 @DataDog/system-tests-core

@datadog-datadog-prod-us1
Copy link
Copy Markdown

datadog-datadog-prod-us1 Bot commented May 4, 2026

Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🧪 3 Tests failed

tests.appsec.test_remote_config_rule_changes.Test_UpdateRuleFileWithRemoteConfig.test_update_rules[poc] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: assert '1.15.0' == '2.71.8182'
  - 2.71.8182
  + 1.15.0

self = <tests.appsec.test_remote_config_rule_changes.Test_UpdateRuleFileWithRemoteConfig object at 0x7f834d07a900>

    def test_update_rules(self):
        expected_rules_version_tag = "_dd.appsec.event_rules.version"
        expected_version_regex = r"[0-9]+\.[0-9]+\.[0-9]+"
    
...
tests.appsec.test_remote_config_rule_changes.Test_UpdateRuleFileWithRemoteConfig.test_update_rules[poc] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: missing span meta tag \`_dd.appsec.event_rules.version\` in meta
assert '_dd.appsec.event_rules.version' in {'_dd.appsec.fp.http.endpoint': 'http-get-ce4235d7--', '_dd.appsec.fp.http.header': 'hdr-0010000000-dfb9a587-1-4740ae63', '_dd.appsec.fp.http.network': 'net-0-0000000000', '_dd.appsec.fp.session': 'ssn----b2d936d9', ...}

self = <tests.appsec.test_remote_config_rule_changes.Test_UpdateRuleFileWithRemoteConfig object at 0x7fa8b827c410>

    def test_update_rules(self):
        expected_rules_version_tag = "_dd.appsec.event_rules.version"
        expected_version_regex = r"[0-9]+\.[0-9]+\.[0-9]+"
    
        def validate_waf_rule_version_tag(span: DataDogLibrarySpan, appsec_data: dict):  # noqa: ARG001
...
tests.appsec.test_remote_config_rule_changes.Test_UpdateRuleFileWithRemoteConfig.test_update_rules[uds] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: assert '1.15.0' == '2.71.8182'
  - 2.71.8182
  + 1.15.0

self = <tests.appsec.test_remote_config_rule_changes.Test_UpdateRuleFileWithRemoteConfig object at 0x7f7e4c2174d0>

    def test_update_rules(self):
        expected_rules_version_tag = "_dd.appsec.event_rules.version"
        expected_version_regex = r"[0-9]+\.[0-9]+\.[0-9]+"
    
...

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a871fab | Docs | Datadog PR Page | Give us feedback!

@rochdev
Copy link
Copy Markdown
Member Author

rochdev commented May 4, 2026

PHP is failing everywhere, not related to this PR.

@rochdev rochdev marked this pull request as ready for review May 4, 2026 19:18
@rochdev rochdev requested review from a team as code owners May 4, 2026 19:18
@rochdev rochdev requested review from dd-oleksii, greghuels and rachelyangdog and removed request for a team May 4, 2026 19:18
Comment on lines +241 to +245
if rc_api_enabled:
# Low poll interval ensures RC config changes are picked up quickly during tests.
# Polling aggressively is safe because the RC backend is a mock, not a
# real Datadog backend that could be affected by high request rates.
self._weblog_env.setdefault("DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS", "0.2")
Copy link
Copy Markdown
Collaborator

@cbeauchesne cbeauchesne May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a QA point of view, I advise to not change this globally : it set all end-to-end scenarios to a configuration different from the default configuration, which is a violation of a key principle of end-to-end testing : stay as close as possible from what your customers are using ; and triple check that every trade-off on this rule is more than worthy.

While working on this, I also realized that all tests using RC are already overriding the value, but each test does so on their own. This makes it very easy to forget to do it for a specific test (like the above) which ends up slowing the entire workflow. So I instead removed all the overrides and add a fixed value in a centralized location. This will thus benefit any future tests as well.

I'd say it was made on purpose, either to test this specific value, or to optimize the runtime, making a one-of compromise with the above rule.

This makes for example APPSEC_RUNTIME_ACTIVATION go from ~10 minutes to ~2 minutes.

Instead, I (strongly) advise to set this value for specific scenarios like this one where it makes a real difference, justifying a good tradeoff, with an explicit aproval from the team owning product impacted by this change.

Setting this as a blocking comment, until we have the approval form all impacted teams.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a QA point of view, I advise to not change this globally : it set all end-to-end scenarios to a configuration different from the default configuration, which is a violation of a key principle of end-to-end testing : stay as close as possible from what your customers are using ; and triple check that every trade-off on this rule is more than worthy.

That would be true for a configuration that changes behaviour entirely, but not for something like polling. The only reason the default value is high is because we use short-polling, and you can't have too short of an interval with short-polling, otherwise you would overload the server. This is not an issue here, so we can use a much lower value. It still polls exactly the same way, but faster.

I'd say it was made on purpose, either to test this specific value, or to optimize the runtime, making a one-of compromise with the above rule.

It's because it makes everything slow, same issue.

Instead, I (strongly) advise to set this value for specific scenarios like this one where it makes a real difference, justifying a good tradeoff, with an explicit aproval from the team owning product impacted by this change.

The value is already set in many places, and would need to be set in many more now, and implementors of any future test would need to know to do it or we'd be back to square one. If something is needed everywhere always, it should be set once in a centralized location. There is absolutely no benefit to your suggestion. Polling should always be faster in test. That's true not only for RC but for any other polling we do.

self._weblog_env = dict(weblog_env) if weblog_env else {}
if rc_api_enabled:
# Low poll interval ensures RC config changes are picked up quickly during tests.
# Polling aggressively is safe because the RC backend is a mock, not a
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default it's mocked, but it's an option. if mocked_backend is set to false, then it's not mocked. Could you add a condition on this value ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still poll faster, we don't have nearly as many hosts sending requests than in production.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very probably, But then the comment is not accurate, could you update it ?

Comment on lines +193 to +195
# Low poll interval ensures RC config changes are picked up quickly during tests.
# Polling aggressively is safe because the RC backend is a mock, not a
# real Datadog backend that could be affected by high request rates.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong, the parametric scenario does not use nay backend.

Suggested change
# Low poll interval ensures RC config changes are picked up quickly during tests.
# Polling aggressively is safe because the RC backend is a mock, not a
# real Datadog backend that could be affected by high request rates.
# Low poll interval ensures RC config changes are picked up quickly during tests.
# Polling aggressively is safe because parametric scenario does not use any backend

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that was worded properly. The reason we can't poll aggressively by default is two-fold:

  1. Way too many hosts in the real world, and we can't have all of them poll every 200ms.
  2. Real application are long-lived, and the config shouldn't change often. This is not true in tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I get it. It's just to precise that it's safe to use this in the parametric scenario, not because we use a mocked backend, but because so it's safe to use this does not use any backend.

@rochdev
Copy link
Copy Markdown
Member Author

rochdev commented May 4, 2026

Looks like there is a race condition in .NET, should it just be excluded for now?

@rochdev
Copy link
Copy Markdown
Member Author

rochdev commented May 12, 2026

Superseded by #6879

@rochdev rochdev closed this May 12, 2026
@cbeauchesne cbeauchesne deleted the rochdev/faster-rc-poll branch May 12, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants