update remote config polling to be faster and centralized#6851
Conversation
|
|
|
✨ Fix all issues with BitsAI or with Cursor
|
|
PHP is failing everywhere, not related to this PR. |
| if rc_api_enabled: | ||
| # Low poll interval ensures RC config changes are picked up quickly during tests. | ||
| # Polling aggressively is safe because the RC backend is a mock, not a | ||
| # real Datadog backend that could be affected by high request rates. | ||
| self._weblog_env.setdefault("DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS", "0.2") |
There was a problem hiding this comment.
From a QA point of view, I advise to not change this globally : it set all end-to-end scenarios to a configuration different from the default configuration, which is a violation of a key principle of end-to-end testing : stay as close as possible from what your customers are using ; and triple check that every trade-off on this rule is more than worthy.
While working on this, I also realized that all tests using RC are already overriding the value, but each test does so on their own. This makes it very easy to forget to do it for a specific test (like the above) which ends up slowing the entire workflow. So I instead removed all the overrides and add a fixed value in a centralized location. This will thus benefit any future tests as well.
I'd say it was made on purpose, either to test this specific value, or to optimize the runtime, making a one-of compromise with the above rule.
This makes for example APPSEC_RUNTIME_ACTIVATION go from ~10 minutes to ~2 minutes.
Instead, I (strongly) advise to set this value for specific scenarios like this one where it makes a real difference, justifying a good tradeoff, with an explicit aproval from the team owning product impacted by this change.
Setting this as a blocking comment, until we have the approval form all impacted teams.
There was a problem hiding this comment.
From a QA point of view, I advise to not change this globally : it set all end-to-end scenarios to a configuration different from the default configuration, which is a violation of a key principle of end-to-end testing : stay as close as possible from what your customers are using ; and triple check that every trade-off on this rule is more than worthy.
That would be true for a configuration that changes behaviour entirely, but not for something like polling. The only reason the default value is high is because we use short-polling, and you can't have too short of an interval with short-polling, otherwise you would overload the server. This is not an issue here, so we can use a much lower value. It still polls exactly the same way, but faster.
I'd say it was made on purpose, either to test this specific value, or to optimize the runtime, making a one-of compromise with the above rule.
It's because it makes everything slow, same issue.
Instead, I (strongly) advise to set this value for specific scenarios like this one where it makes a real difference, justifying a good tradeoff, with an explicit aproval from the team owning product impacted by this change.
The value is already set in many places, and would need to be set in many more now, and implementors of any future test would need to know to do it or we'd be back to square one. If something is needed everywhere always, it should be set once in a centralized location. There is absolutely no benefit to your suggestion. Polling should always be faster in test. That's true not only for RC but for any other polling we do.
| self._weblog_env = dict(weblog_env) if weblog_env else {} | ||
| if rc_api_enabled: | ||
| # Low poll interval ensures RC config changes are picked up quickly during tests. | ||
| # Polling aggressively is safe because the RC backend is a mock, not a |
There was a problem hiding this comment.
By default it's mocked, but it's an option. if mocked_backend is set to false, then it's not mocked. Could you add a condition on this value ?
There was a problem hiding this comment.
We can still poll faster, we don't have nearly as many hosts sending requests than in production.
There was a problem hiding this comment.
Very probably, But then the comment is not accurate, could you update it ?
| # Low poll interval ensures RC config changes are picked up quickly during tests. | ||
| # Polling aggressively is safe because the RC backend is a mock, not a | ||
| # real Datadog backend that could be affected by high request rates. |
There was a problem hiding this comment.
If I'm not wrong, the parametric scenario does not use nay backend.
| # Low poll interval ensures RC config changes are picked up quickly during tests. | |
| # Polling aggressively is safe because the RC backend is a mock, not a | |
| # real Datadog backend that could be affected by high request rates. | |
| # Low poll interval ensures RC config changes are picked up quickly during tests. | |
| # Polling aggressively is safe because parametric scenario does not use any backend |
There was a problem hiding this comment.
Maybe that was worded properly. The reason we can't poll aggressively by default is two-fold:
- Way too many hosts in the real world, and we can't have all of them poll every 200ms.
- Real application are long-lived, and the config shouldn't change often. This is not true in tests.
There was a problem hiding this comment.
Yep, I get it. It's just to precise that it's safe to use this in the parametric scenario, not because we use a mocked backend, but because so it's safe to use this does not use any backend.
|
Looks like there is a race condition in .NET, should it just be excluded for now? |
|
Superseded by #6879 |
Motivation
The main motivation behind this PR was to fix the
APPSEC_RUNTIME_ACTIVATION,APPSEC_AUTO_EVENTS_RCandAPPSEC_API_SECURITY_RCtests which are extremely slow. Turned out that most of the time was spent waiting on RC polling since it happens infrequently in production to avoid overloading our backend. This limitation does not exist for tests, so they should instead poll as quickly as possible. This makes for exampleAPPSEC_RUNTIME_ACTIVATIONgo from ~10 minutes to ~2 minutes.While working on this, I also realized that all tests using RC are already overriding the value, but each test does so on their own. This makes it very easy to forget to do it for a specific test (like the above) which ends up slowing the entire workflow. So I instead removed all the overrides and add a fixed value in a centralized location. This will thus benefit any future tests as well.
Changes
Update remote config polling to be faster and centralized.
I had originally opened #6808 which combined multiple fixes and didn't quite get to the root cause of remote config specifically, so it's superseded by this one and the other fixes will be later split up as well.
Workflow
🚀 Once your PR is reviewed and the CI green, you can merge it!
🛟 #apm-shared-testing 🛟
Reviewer checklist
tests/ormanifests/is modified ? I have the approval from R&P teambuild-XXX-imagelabel is present