Fix silent taint loss under concurrent spot interruptions by mcornea · Pull Request #1278 · aws/aws-node-termination-handler

mcornea · 2026-05-20T14:29:24Z

Issue #, if available:
#1277

Description of changes:
addTaint uses a full-object Nodes().Update() that conflicts with concurrent cordon operations due to resourceVersion mismatch (HTTP 409 Conflict). Under high WORKERS values (>=30), the 5-second retry budget with 750ms fixed intervals is exhausted for ~6-13% of taint attempts. Additionally, the PreDrainTask closure in spot-itn-event.go swallows the taint error (returns nil), so the caller sees success, processing continues to cordon and drain, and the SQS message is deleted. No retry ever occurs for the failed taint.

This fix:

Replaces Nodes().Update() with Nodes().Patch() using StrategicMergePatchType so taint patches don't conflict with concurrent cordon patches (they target different fields: spec.taints vs spec.unschedulable)
Always fetches fresh node state before building the patch (removes the wasted first retry on a stale DeepCopy)
Increases retry budget from 5s/750ms to 15s/500ms
Propagates the taint error from PreDrainTask so the SQS message is not deleted and the event is retried on the next poll cycle

How you tested your changes:
Environment (Linux / Windows): Linux (RHEL CoreOS 9.8)
Kubernetes Version: v1.35.4 (OpenShift 4.22)

Tested on a ROSA HCP cluster with 100 spot instances (c5a.xlarge). Injected 100 simultaneous synthetic SQS spot interruption messages and verified that all 100 nodes were tainted at WORKERS=50. Before the fix, only 87-91 out of 100 taints succeeded at that concurrency level.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

addTaint uses a full-object Nodes().Update() that conflicts with concurrent cordon operations (which also modify the node object) due to resourceVersion mismatch (HTTP 409 Conflict). Under high WORKERS values (>=30), the 5-second retry budget with 750ms fixed intervals is exhausted for ~6-13% of taint attempts. Additionally, the PreDrainTask closure in spot-itn-event.go swallows the taint error (returns nil), so the caller sees success, processing continues to cordon and drain, and the SQS message is deleted. No retry ever occurs for the failed taint. Fix by: - Replacing Nodes().Update() with Nodes().Patch() using StrategicMergePatchType so taint patches don't conflict with concurrent cordon patches (they target different fields) - Always fetching fresh node state before building the patch (removes the wasted first retry on a stale DeepCopy) - Increasing retry budget from 5s/750ms to 15s/500ms - Propagating the taint error from PreDrainTask so the SQS message is not deleted and the event is retried on the next poll cycle Signed-off-by: Marius Cornea <mcornea@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mcornea requested a review from a team as a code owner May 20, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix silent taint loss under concurrent spot interruptions#1278

Fix silent taint loss under concurrent spot interruptions#1278
mcornea wants to merge 1 commit into
aws:mainfrom
mcornea:fix-taint-concurrency

mcornea commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcornea commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mcornea commented May 20, 2026 •

edited

Loading