Skip to content

{AKS} Add provisioningState retry logic with two-phase check validation#32939

Open
deveshdama wants to merge 1 commit intoAzure:devfrom
deveshdama:provisioningstate-retry
Open

{AKS} Add provisioningState retry logic with two-phase check validation#32939
deveshdama wants to merge 1 commit intoAzure:devfrom
deveshdama:provisioningstate-retry

Conversation

@deveshdama
Copy link
Contributor

@deveshdama deveshdama commented Mar 9, 2026

Add opt-in retry logic to the Azure CLI test SDK for handling provisioningState race conditions caused by external modifications (e.g., Azure Policy) during live tests.

Related command
N/A

Description

Problem:

provisioningState race condition caused by Azure Policy seen in AKS CLI Runners that use azure cli test sdk. Live tests in Microsoft-managed subscriptions intermittently fail with:

Query 'provisioningState' doesn't yield expected value 'Succeeded', instead the actual value is 'Updating'

Time Actor Action provisioningState
20:46:01 Azure CLI PUT managed cluster (cli test) -
20:56:52 AKS PUT completes successfully Succeeded
20:57:08 Azure Policy PUT managed cluster (azure policy compliance) Updating
20:57:14 Azure CLI SDK GET managed cluster (test assertion) Updating (cli test failure)

Root cause (thanks @FumingZhang for the thorough investigation):

The failure occurs due to a race condition between the Azure CLI SDK's internal GET request and an Azure Policy PUT request(in our case).
Sequence of events causing this race:

CLI sends PUT to create/update a managed cluster - the operation succeeds (provisioningState = Succeeded at T+0).
Azure Policy sends its own PUT to the same resource at T+6s (user agent = ARM), re-applying policy compliance. This resets provisioningState to Updating.
CLI SDK sends GET at T+8s to read the resource and run test assertions. It sees provisioningState = Updating instead of Succeeded, and the test fails - even though the CLI operation itself succeeded.

Proposed Fix in the PR:

When env var AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true, test framework cmd() method splits the checks into two sequential phases:

Phase-1 : only provisioningState validation : If provisioningState != Succeeded, poll the resource via az resource show --ids <id> with exponential backoff until it reaches a terminal state Succeeded, Failed, or Canceled.
Phase-2 : remaining validations : Validate all non-provisioningState checks against the original command result, validates CLI operation itself produced correct output.

usage of etag:

When provisioningState is not Succeeded, poll the resource until it reaches a terminal state or timeout. Expectation here is that extrenal operation completes and provisioningState returns Succeeded, allowing the tests to succeed.

etag here is being used for observability only and it does not change the flow of the code.

  1. Original etag is captured during initial command's execution result.
  2. If provisioningState does not match with the expected value, polling begins:
    • On each poll, compare current etag on the resource from last seen value.
    • if they differ, etag has been changed by an external actor(in our case azure policy). Log a warning.
  3. On timeout, both original and current etags are logged in error message for debugging

Testing Guide
Attaching testing script. #32939 (comment)

History Notes

[Component Name 1] BREAKING CHANGE: az command a: Make some customer-facing breaking change
[Component Name 2] az command b: Add some customer-facing feature


This checklist is used to make sure that common guidelines for a pull request are followed.

Copilot AI review requested due to automatic review settings March 9, 2026 08:41
@azure-client-tools-bot-prd
Copy link

azure-client-tools-bot-prd bot commented Mar 9, 2026

️✔️AzureCLI-FullTest
️✔️acr
️✔️latest
️✔️3.12
️✔️3.13
️✔️acs
️✔️latest
️✔️3.12
️✔️3.13
️✔️advisor
️✔️latest
️✔️3.12
️✔️3.13
️✔️ams
️✔️latest
️✔️3.12
️✔️3.13
️✔️apim
️✔️latest
️✔️3.12
️✔️3.13
️✔️appconfig
️✔️latest
️✔️3.12
️✔️3.13
️✔️appservice
️✔️latest
️✔️3.12
️✔️3.13
️✔️aro
️✔️latest
️✔️3.12
️✔️3.13
️✔️backup
️✔️latest
️✔️3.12
️✔️3.13
️✔️batch
️✔️latest
️✔️3.12
️✔️3.13
️✔️batchai
️✔️latest
️✔️3.12
️✔️3.13
️✔️billing
️✔️latest
️✔️3.12
️✔️3.13
️✔️botservice
️✔️latest
️✔️3.12
️✔️3.13
️✔️cdn
️✔️latest
️✔️3.12
️✔️3.13
️✔️cloud
️✔️latest
️✔️3.12
️✔️3.13
️✔️cognitiveservices
️✔️latest
️✔️3.12
️✔️3.13
️✔️compute_recommender
️✔️latest
️✔️3.12
️✔️3.13
️✔️computefleet
️✔️latest
️✔️3.12
️✔️3.13
️✔️config
️✔️latest
️✔️3.12
️✔️3.13
️✔️configure
️✔️latest
️✔️3.12
️✔️3.13
️✔️consumption
️✔️latest
️✔️3.12
️✔️3.13
️✔️container
️✔️latest
️✔️3.12
️✔️3.13
️✔️containerapp
️✔️latest
️✔️3.12
️✔️3.13
️✔️core
️✔️latest
️✔️3.12
️✔️3.13
️✔️cosmosdb
️✔️latest
️✔️3.12
️✔️3.13
️✔️databoxedge
️✔️latest
️✔️3.12
️✔️3.13
️✔️dls
️✔️latest
️✔️3.12
️✔️3.13
️✔️dms
️✔️latest
️✔️3.12
️✔️3.13
️✔️eventgrid
️✔️latest
️✔️3.12
️✔️3.13
️✔️eventhubs
️✔️latest
️✔️3.12
️✔️3.13
️✔️feedback
️✔️latest
️✔️3.12
️✔️3.13
️✔️find
️✔️latest
️✔️3.12
️✔️3.13
️✔️hdinsight
️✔️latest
️✔️3.12
️✔️3.13
️✔️identity
️✔️latest
️✔️3.12
️✔️3.13
️✔️iot
️✔️latest
️✔️3.12
️✔️3.13
️✔️keyvault
️✔️latest
️✔️3.12
️✔️3.13
️✔️lab
️✔️latest
️✔️3.12
️✔️3.13
️✔️managedservices
️✔️latest
️✔️3.12
️✔️3.13
️✔️maps
️✔️latest
️✔️3.12
️✔️3.13
️✔️marketplaceordering
️✔️latest
️✔️3.12
️✔️3.13
️✔️monitor
️✔️latest
️✔️3.12
️✔️3.13
️✔️mysql
️✔️latest
️✔️3.12
️✔️3.13
️✔️netappfiles
️✔️latest
️✔️3.12
️✔️3.13
️✔️network
️✔️latest
️✔️3.12
️✔️3.13
️✔️policyinsights
️✔️latest
️✔️3.12
️✔️3.13
️✔️postgresql
️✔️latest
️✔️3.12
️✔️3.13
️✔️privatedns
️✔️latest
️✔️3.12
️✔️3.13
️✔️profile
️✔️latest
️✔️3.12
️✔️3.13
️✔️rdbms
️✔️latest
️✔️3.12
️✔️3.13
️✔️redis
️✔️latest
️✔️3.12
️✔️3.13
️✔️relay
️✔️latest
️✔️3.12
️✔️3.13
️✔️resource
️✔️latest
️✔️3.12
️✔️3.13
️✔️role
️✔️latest
️✔️3.12
️✔️3.13
️✔️search
️✔️latest
️✔️3.12
️✔️3.13
️✔️security
️✔️latest
️✔️3.12
️✔️3.13
️✔️servicebus
️✔️latest
️✔️3.12
️✔️3.13
️✔️serviceconnector
️✔️latest
️✔️3.12
️✔️3.13
️✔️servicefabric
️✔️latest
️✔️3.12
️✔️3.13
️✔️signalr
️✔️latest
️✔️3.12
️✔️3.13
️✔️sql
️✔️latest
️✔️3.12
️✔️3.13
️✔️sqlvm
️✔️latest
️✔️3.12
️✔️3.13
️✔️storage
️✔️latest
️✔️3.12
️✔️3.13
️✔️synapse
️✔️latest
️✔️3.12
️✔️3.13
️✔️telemetry
️✔️latest
️✔️3.12
️✔️3.13
️✔️util
️✔️latest
️✔️3.12
️✔️3.13
️✔️vm
️✔️latest
️✔️3.12
️✔️3.13

@azure-client-tools-bot-prd
Copy link

azure-client-tools-bot-prd bot commented Mar 9, 2026

️✔️AzureCLI-BreakingChangeTest
️✔️Non Breaking Changes

@yonzhan
Copy link
Collaborator

yonzhan commented Mar 9, 2026

Thank you for your contribution! We will review the pull request and get back to you soon.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

The git hooks are available for azure-cli and azure-cli-extensions repos. They could help you run required checks before creating the PR.

Please sync the latest code with latest dev branch (for azure-cli) or main branch (for azure-cli-extensions).
After that please run the following commands to enable git hooks:

pip install azdev --upgrade
azdev setup -c <your azure-cli repo path> -r <your azure-cli-extensions repo path>

@microsoft-github-policy-service microsoft-github-policy-service bot added the Auto-Assign Auto assign by bot label Mar 9, 2026
@microsoft-github-policy-service microsoft-github-policy-service bot added the ARM az resource/group/lock/tag/deployment/policy/managementapp/account management-group label Mar 9, 2026
@microsoft-github-policy-service microsoft-github-policy-service bot added the AKS az aks/acs/openshift label Mar 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in two-phase provisioningState retry mechanism to the Azure CLI test SDK's CheckerMixin, designed to handle race conditions where external actors (e.g., Azure Policy) may modify a resource's provisioningState between when a command completes and when the test checks it.

Changes:

  • New methods on CheckerMixin (_is_provisioning_state_check, _should_retry_for_provisioning_state, _cmd_with_retry) implementing the two-phase check logic.
  • cmd() overrides in ScenarioTest (live-mode only) and LiveScenarioTest to invoke the retry path when enabled via AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true.
  • New test file covering both _should_retry_for_provisioning_state and _cmd_with_retry across multiple scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/azure-cli-testsdk/azure/cli/testsdk/base.py Core retry logic: new CheckerMixin methods and cmd() overrides for ScenarioTest/LiveScenarioTest
src/azure-cli-testsdk/azure/cli/testsdk/scenario_tests/tests/test_provisioning_retry.py Unit tests covering all retry scenarios including happy path, polling, etag change, timeout, and two-phase validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yonzhan yonzhan removed the ARM az resource/group/lock/tag/deployment/policy/managementapp/account management-group label Mar 9, 2026
@deveshdama deveshdama force-pushed the provisioningstate-retry branch 6 times, most recently from 6111c61 to d1764bc Compare March 9, 2026 17:25
Add opt-in retry logic to the Azure CLI test SDK for handling
provisioningState race conditions caused by external modifications
(e.g., Azure Policy) during live tests.

When enabled via AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true, the
test framework's cmd() method uses a two-phase check approach:

Phase 1: Validate provisioningState == Succeeded. If not, poll the
resource via 'az resource show --ids' with exponential backoff until
it reaches a terminal state (Succeeded, Failed, or Canceled).

Phase 2: Validate all remaining checks against the original command
result, ensuring the operation itself produced correct output.

Features:
- Opt-in via environment variable (zero risk to existing tests)
- Only activates in live mode (playback tests unaffected)
- Exponential backoff with jitter (configurable via env vars)
- ETag change detection for external modification warnings
- Terminal state short-circuit (Failed/Canceled stop polling)
- TimeoutError after configurable max retries
@deveshdama deveshdama force-pushed the provisioningstate-retry branch from d1764bc to df2d5cf Compare March 9, 2026 17:30
@deveshdama
Copy link
Contributor Author

repro_provisioning_retry.py

Attaching repro script for validations.

Copy link
Member

@FumingZhang FumingZhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me based on testing in live mode. However, will the recording file generated during live mode execution, especially when the provisioning state check is retried, still work correctly in replay mode? I’m concerned that there could be additional resource get operations recorded, which might cause a mismatch in replay mode.

@deveshdama
Copy link
Contributor Author

The changes look good to me based on testing in live mode. However, will the recording file generated during live mode execution, especially when the provisioning state check is retried, still work correctly in replay mode? I’m concerned that there could be additional resource get operations recorded, which might cause a mismatch in replay mode.

The retry logic is gated behind self.is_live, so it never runs during playback.
Also since the feature is opt-in and only activates when env var AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true so test recording/playback workflows are unaffected.

cc: @FumingZhang

@FumingZhang
Copy link
Member

FumingZhang commented Mar 11, 2026

The retry logic is gated behind self.is_live, so it never runs during playback.

This is exactly what concerns me. In replay mode, it shouldn't need to send a resource get request using az resource show after the resource is created. However, that request would be recorded in the cassette file if the test running in live mode encountered an issue and retried. In such cases, when the vcr library tries to match the request and response from the cassette file during replay testing, there could be a mismatch. Since some request/response pairs can be ignored, I would like your help to confirm whether this change could cause the newly generated cassette file to fail a subsequent replay test, @deveshdama

@deveshdama
Copy link
Contributor Author

deveshdama commented Mar 15, 2026

The retry logic is gated behind self.is_live, so it never runs during playback.

This is exactly what concerns me. In replay mode, it shouldn't need to send a resource get request using az resource show after the resource is created. However, that request would be recorded in the cassette file if the test running in live mode encountered an issue and retried. In such cases, when the vcr library tries to match the request and response from the cassette file during replay testing, there could be a mismatch. Since some request/response pairs can be ignored, I would like your help to confirm whether this change could cause the newly generated cassette file to fail a subsequent replay test, @deveshdama

The retry is gated behind if self.is_live. During live recording, extra poll requests get recorded
into the cassette. During replay, the retry is skipped - those entries sit unconsumed. VCR is strict
on missing entries (fails if a request has no match) but lenient on extra entries (silently ignores
unconsumed ones). The only side effect is a slightly larger yaml file.

I verified this by tracing through vcrpy's source:

this script demonstrates this:
https://gist.github.com/deveshdama/4a94fb462660731170d9548cd1df5c1b

pip install vcrpy && python3 vcr_proof_demo.py

Output:

PHASE 2: REPLAY  (record_mode='once', only 1 request, 2 skipped)
   Consumed: 1/3 interactions
   Unconsumed: 2  (the retry poll entries)
   cassette.all_played = False
   >>> VCR exited cleanly. No error from unconsumed entries. <<<

cc: @FumingZhang

@jiasli
Copy link
Member

jiasli commented Mar 17, 2026

  1. It's better to explain the provisioningState race conditions in detail (in a step by step manner) to help others understand what the problem actually is.

  2. If I understand correctly, nowadays, running live tests in Microsoft-managed subscriptions is not a standard approach. This PR may mitigate some problems for a while, but when new policies are deployed, other issues may emerge. The standard is to use BAMI tenants and subscriptions.

  3. Minor: It's better to use the correct Markdown syntax in the PR description to make it more readable.

@jiasli
Copy link
Member

jiasli commented Mar 17, 2026

Claude Opus 4.6 states this PR shows strong AI aiding signals. If it is true, it may be worth considering if we really need this level of complexity (over engineering?), as this will make reviewing and maintaining harder and require more resources for testing.

Pasting Claude Opus 4.6's answer below:


Based on analyzing the PR's code, commit messages, PR description, and overall patterns, there are several strong signals suggesting this PR was significantly aided by AI:


🔍 Evidence

1. Highly polished commit message structure

The single commit message reads like a well-formatted AI-generated summary — bullet points with marketing-style feature descriptions:

  • "Opt-in via environment variable (zero risk to existing tests)"
  • "Exponential backoff with jitter (configurable via env vars)"
  • "Terminal state short-circuit (Failed/Canceled stop polling)"

This level of structured, polished prose in a commit message is uncommon for typical human-written commits, which tend to be terser.

2. PR description ≈ commit message ≈ code docstrings

The PR description, commit message body, and inline code docstrings all repeat nearly identical phrasing word-for-word (e.g., "two-phase check approach", "Phase 1/Phase 2" descriptions). AI tools tend to generate consistent descriptions across artifacts from the same prompt.

3. Test file is suspiciously comprehensive for a single commit

The test file (test_provisioning_retry.py) is 406 lines with 13 well-structured test cases covering every edge case (happy path, timeout, terminal failure, no resource ID, ETag change, two-phase combinations). This level of test coverage in a single commit — with consistent naming conventions, detailed docstrings for every test, and clean mock patterns — is characteristic of AI-generated test code.

4. Boilerplate docstring style

Every method has a precisely formatted docstring:

def _cmd_with_retry(self, command, checks, cli_ctx):
    """Execute command with two-phase check validation.

    Phase 1: Validate provisioningState == Succeeded, retrying with poll if needed.
    Phase 2: Validate all remaining checks on the original result.

    Uses etag to detect external modifications (e.g. Azure Policy).
    """

This is clean but feels templated — the style is more consistent than typical human code where docstring quality tends to vary.

5. PR description has unfilled boilerplate

The history notes section was left as template text:

[Component Name 1] BREAKING CHANGE: az command a: Make some customer-facing breaking change

This is common when AI generates the PR body from a template prompt but the author doesn't review/clean up the parts that need manual input.

6. Copilot reviewer suggestion was adopted verbatim

The _is_provisioning_state_check method in the latest commit matches the Copilot reviewer's suggested code almost exactly, which is fine (that's the purpose of suggestions), but combined with other signals, it reinforces the AI-assisted pattern.

7. Code and test style uniformity

The production code and test code show unusually consistent style — same commenting patterns, same level of detail, same structural approach. Human-written code typically shows subtle style shifts between production and test code.


📊 Assessment

Signal Strength
Commit message polish 🟡 Medium-High
Description/docstring repetition 🟡 Medium
Test coverage completeness in single commit 🔴 High
Boilerplate left unfilled 🟡 Medium
Overall code uniformity 🟡 Medium

Verdict: Likely AI-assisted — The code quality is decent and the logic is sound, but the uniformity of style, the comprehensiveness of tests in a single pass, and the polished-but-template-like documentation strongly suggest significant AI involvement in generating the code and tests. The author likely used an AI tool (possibly GitHub Copilot or ChatGPT) to scaffold the implementation and tests, then made targeted adjustments.

Note: AI-assisted code isn't inherently bad — the code itself is functional and well-tested. The concern is more about whether the author fully understands every nuance (e.g., the replay-mode question raised by @FumingZhang, or your question about the ETag logic explanation).

@deveshdama
Copy link
Contributor Author

  1. It's better to explain the provisioningState race conditions in detail (in a step by step manner) to help others understand what the problem actually is.
  2. If I understand correctly, nowadays, running live tests in Microsoft-managed subscriptions is not a standard approach. This PR may mitigate some problems for a while, but when new policies are deployed, other issues may emerge. The standard is to use BAMI tenants and subscriptions.
  3. Minor: It's better to use the correct Markdown syntax in the PR description to make it more readable.
  1. I've tried to explain the issue and the fix in detail in the PR description. Added a table explaining the exact sequence from one of the failures we investigated.
  2. @FumingZhang can help claify this one.
  3. Done, thanks for pointing it out.

@deveshdama
Copy link
Contributor Author

Claude Opus 4.6 states this PR shows strong AI aiding signals. If it is true, it may be worth considering if we really need this level of complexity (over engineering?), as this will make reviewing and maintaining harder and require more resources for testing.

Pasting Claude Opus 4.6's answer below:

Based on analyzing the PR's code, commit messages, PR description, and overall patterns, there are several strong signals suggesting this PR was significantly aided by AI:

🔍 Evidence

1. Highly polished commit message structure

The single commit message reads like a well-formatted AI-generated summary — bullet points with marketing-style feature descriptions:

  • "Opt-in via environment variable (zero risk to existing tests)"
  • "Exponential backoff with jitter (configurable via env vars)"
  • "Terminal state short-circuit (Failed/Canceled stop polling)"

This level of structured, polished prose in a commit message is uncommon for typical human-written commits, which tend to be terser.

2. PR description ≈ commit message ≈ code docstrings

The PR description, commit message body, and inline code docstrings all repeat nearly identical phrasing word-for-word (e.g., "two-phase check approach", "Phase 1/Phase 2" descriptions). AI tools tend to generate consistent descriptions across artifacts from the same prompt.

3. Test file is suspiciously comprehensive for a single commit

The test file (test_provisioning_retry.py) is 406 lines with 13 well-structured test cases covering every edge case (happy path, timeout, terminal failure, no resource ID, ETag change, two-phase combinations). This level of test coverage in a single commit — with consistent naming conventions, detailed docstrings for every test, and clean mock patterns — is characteristic of AI-generated test code.

4. Boilerplate docstring style

Every method has a precisely formatted docstring:

def _cmd_with_retry(self, command, checks, cli_ctx):
    """Execute command with two-phase check validation.

    Phase 1: Validate provisioningState == Succeeded, retrying with poll if needed.
    Phase 2: Validate all remaining checks on the original result.

    Uses etag to detect external modifications (e.g. Azure Policy).
    """

This is clean but feels templated — the style is more consistent than typical human code where docstring quality tends to vary.

5. PR description has unfilled boilerplate

The history notes section was left as template text:

[Component Name 1] BREAKING CHANGE: az command a: Make some customer-facing breaking change

This is common when AI generates the PR body from a template prompt but the author doesn't review/clean up the parts that need manual input.

6. Copilot reviewer suggestion was adopted verbatim

The _is_provisioning_state_check method in the latest commit matches the Copilot reviewer's suggested code almost exactly, which is fine (that's the purpose of suggestions), but combined with other signals, it reinforces the AI-assisted pattern.

7. Code and test style uniformity

The production code and test code show unusually consistent style — same commenting patterns, same level of detail, same structural approach. Human-written code typically shows subtle style shifts between production and test code.

📊 Assessment

Signal Strength
Commit message polish 🟡 Medium-High
Description/docstring repetition 🟡 Medium
Test coverage completeness in single commit 🔴 High
Boilerplate left unfilled 🟡 Medium
Overall code uniformity 🟡 Medium
Verdict: Likely AI-assisted — The code quality is decent and the logic is sound, but the uniformity of style, the comprehensiveness of tests in a single pass, and the polished-but-template-like documentation strongly suggest significant AI involvement in generating the code and tests. The author likely used an AI tool (possibly GitHub Copilot or ChatGPT) to scaffold the implementation and tests, then made targeted adjustments.

Note: AI-assisted code isn't inherently bad — the code itself is functional and well-tested. The concern is more about whether the author fully understands every nuance (e.g., the replay-mode question raised by @FumingZhang, or your question about the ETag logic explanation).

@jiasli
copilot has been an assistant with most of the work. We've been looking at this issue for some time now and it took some thorough investigation from @FumingZhang to get to the rootcause. I can share some kusto/queries results if that helps. This is an intermittent issue that generates a lot of false positives in our test runs.
Last 30 days stats:
FailureCount : 204
DistinctTests : 85

over-engineering concern - _cmd_with_retry is about 60 lines, the rest is just tests and guard checks. Since this touches the test framework itself we wanted good coverage, don't want to introduce a bug here. Happy to make changes if you have additional feedback on improving this. This is an opt-in feature and is turned off by default.

On whether we need this - @FumingZhang can comment on the BAMI tenant migration as the long-term fix here. For now, my team has been ignoring all failures with this error message, could cause us to miss out on actual failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AKS az aks/acs/openshift Auto-Assign Auto assign by bot

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants