Fix silent taint loss under concurrent spot interruptions#1278
Open
mcornea wants to merge 1 commit into
Open
Conversation
addTaint uses a full-object Nodes().Update() that conflicts with concurrent cordon operations (which also modify the node object) due to resourceVersion mismatch (HTTP 409 Conflict). Under high WORKERS values (>=30), the 5-second retry budget with 750ms fixed intervals is exhausted for ~6-13% of taint attempts. Additionally, the PreDrainTask closure in spot-itn-event.go swallows the taint error (returns nil), so the caller sees success, processing continues to cordon and drain, and the SQS message is deleted. No retry ever occurs for the failed taint. Fix by: - Replacing Nodes().Update() with Nodes().Patch() using StrategicMergePatchType so taint patches don't conflict with concurrent cordon patches (they target different fields) - Always fetching fresh node state before building the patch (removes the wasted first retry on a stale DeepCopy) - Increasing retry budget from 5s/750ms to 15s/500ms - Propagating the taint error from PreDrainTask so the SQS message is not deleted and the event is retried on the next poll cycle Signed-off-by: Marius Cornea <mcornea@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
#1277
Description of changes:
addTaint uses a full-object Nodes().Update() that conflicts with concurrent cordon operations due to resourceVersion mismatch (HTTP 409 Conflict). Under high WORKERS values (>=30), the 5-second retry budget with 750ms fixed intervals is exhausted for ~6-13% of taint attempts. Additionally, the PreDrainTask closure in spot-itn-event.go swallows the taint error (returns nil), so the caller sees success, processing continues to cordon and drain, and the SQS message is deleted. No retry ever occurs for the failed taint.
This fix:
How you tested your changes:
Environment (Linux / Windows): Linux (RHEL CoreOS 9.8)
Kubernetes Version: v1.35.4 (OpenShift 4.22)
Tested on a ROSA HCP cluster with 100 spot instances (c5a.xlarge). Injected 100 simultaneous synthetic SQS spot interruption messages and verified that all 100 nodes were tainted at WORKERS=50. Before the fix, only 87-91 out of 100 taints succeeded at that concurrency level.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.