[FLINK-39513][checkpointing] Parameterize allowNonRestoredState in restoreInitialCheckpointIfPresent by hemmatio · Pull Request #27991 · apache/flink

hemmatio · 2026-04-21T19:34:34Z

What is the purpose of the change

CheckpointCoordinator.restoreInitialCheckpointIfPresent hardcodes allowNonRestoredState=false, so checkpoint restore on JobManager startup rejects state that cannot be mapped to any operator in the current JobGraph, regardless of execution.savepoint.ignore-unclaimed-state / execution.state-recovery.ignore-unclaimed-state. CheckpointCoordinator.restoreSavepoint already honors the flag via a parameter; this PR does the same for the checkpoint-restore path.

Addresses FLINK-39513. See the ticket for history and full context.

Brief change log

CheckpointCoordinator.restoreInitialCheckpointIfPresent takes allowNonRestoredState as a parameter and forwards it to restoreLatestCheckpointedStateInternal instead of hardcoding false.
DefaultExecutionGraphFactory.createAndRestoreExecutionGraph reads StateRecoveryOptions.SAVEPOINT_IGNORE_UNCLAIMED_STATE from its Configuration and passes it in. Configuration is the source (rather than jobGraph.getSavepointRestoreSettings()) because SavepointRestoreSettings.fromConfiguration returns .none() when no savepoint path is set, which is always the case on this path.

Verifying this change

This change added tests and can be verified as follows:

Added a pair of tests in CheckpointCoordinatorRestoringTest covering both allowNonRestoredState=true (orphaned state silently skipped) and allowNonRestoredState=false (throws IllegalStateException, preserving existing strict behavior).
Updated two existing CheckpointCoordinatorRestoringTest call sites to pass false explicitly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes (Checkpointing)
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

Yes (Claude Code)

…storeInitialCheckpointIfPresent

flinkbot · 2026-04-21T19:42:28Z

CI report:

65c38c0 Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

spuru9 · 2026-04-21T20:19:38Z

@@ -186,7 +187,8 @@ public ExecutionGraph createAndRestoreExecutionGraph(
        if (checkpointCoordinator != null) {
            // check whether we find a valid checkpoint
            if (!checkpointCoordinator.restoreInitialCheckpointIfPresent(
-                    new HashSet<>(newExecutionGraph.getAllVertices().values()))) {
+                    new HashSet<>(newExecutionGraph.getAllVertices().values()),
+                    configuration.get(SavepointConfigOptions.SAVEPOINT_IGNORE_UNCLAIMED_STATE))) {


SavepointConfigOptions was removed as part of 2.0 release

flink/docs/content/release-notes/flink-2.0.md

Line 373 in 9373256

- `org.apache.flink.runtime.jobgraph.SavepointConfigOptions`

Can you check an alternative.

I've moved to using StateRecoveryOptions instead, which is the replacement from 2.0.

spuru9 · 2026-04-21T20:29:20Z

@@ -1113,7 +1113,7 @@ void testRestoreFinishedStateWithoutInFlightData() throws Exception {
                        .build(graph);

        ExecutionJobVertex vertex = graph.getJobVertex(jobVertexID);
-        coord.restoreInitialCheckpointIfPresent(Collections.singleton(vertex));
+        coord.restoreInitialCheckpointIfPresent(Collections.singleton(vertex), false);


Can you add tests for allowNonRestoredState=true (mentioned in PR description).

Forgot to push my amended commit which contained the tests.

Are we waiting here something?

gaborgsomogyi · 2026-04-22T10:23:03Z

I'm not sure under which circumstances could this happen from user perspective. IIUC job fails and tries to restore from CP, right? In such situation the only thing what I can imagine where the operators are different is when SQL plan is different from god knows why. We've seen that same SQL end up in different plan. Can you elaborate please?

gaborgsomogyi

Until it's answered I leave my possible objection here to avoid accidents

hemmatio · 2026-04-22T14:07:53Z

@gaborgsomogyi: Can you elaborate please?

The most common trigger (from what we have seen internally) is user code changes across deployments. This happens with the Flink Kubernetes Operator's last-state upgrade mode as follows:

The user modifies their job (adds/removes/renames an operator, which is the standard use case of allowNonRestoredState)
The operator performs a last-state upgrade: Tears down the cluster without taking a fresh savepoint, relying on checkpoint metadata for the restore.
If the prior JM's shutdown was not graceful (ex: crash during upgrade, OOM, pod eviction), the per-job HA ConfigMap survives. The new JM rebuilds the graph from the modified user code, finds the old checkpoint, and hits the state that doesn't map to any operator in the new graph.
restoreInitialCheckpointIfPresent rejects it via the hardcoded false, regardless of the value of execution.state-recovery.ignore-unclaimed-state.

The asymmetry is the main issue, as there are two different outcomes depending on upgrade mode for the same job, and the same allowNonRestoredState: true:

upgradeMode: savepoint: works, orphaned state skipped, job restores
upgradeMode: last-state: fails with IllegalStateException: There is no operator for the state ...

This problem was also reported in FLINK-30638, where Gyula Fora correctly diagnosed it as a runtime issue in the comments. This PR finally addresses the bug.

The SQL plan drift case would also hit the same code path. Essentially, anything that produces a mismatch between the checkpoint's operator ID set and the new JobGraph's operator IDs will trigger it.

gaborgsomogyi · 2026-04-23T08:09:39Z

We've had a discussion with Gyula and taken a look at the surrounding operator code. Based on that now I see the picture and from directional perspective good to go. Will take a look at the details and take care of this PR

[FLINK-39513][checkpointing] Parameterize allowNonRestoredState in re…

65c38c0

…storeInitialCheckpointIfPresent

hemmatio changed the title ~~Parameterize allowNonRestoredState in restoreInitialCHeckpointIfPresent~~ [FLINK-39513][checkpointing] Parameterize allowNonRestoredState in restoreInitialCheckpointIfPresent Apr 21, 2026

spuru9 suggested changes Apr 21, 2026

View reviewed changes

hemmatio force-pushed the hemmatio-fix-allow-non-restored-state branch from 2b39a69 to 65c38c0 Compare April 21, 2026 20:43

github-actions Bot added the community-reviewed PR has been reviewed by the community. label Apr 22, 2026

gaborgsomogyi requested changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39513][checkpointing] Parameterize allowNonRestoredState in restoreInitialCheckpointIfPresent#27991

[FLINK-39513][checkpointing] Parameterize allowNonRestoredState in restoreInitialCheckpointIfPresent#27991
hemmatio wants to merge 1 commit intoapache:masterfrom
Shopify:hemmatio-fix-allow-non-restored-state

hemmatio commented Apr 21, 2026 •

edited

Loading

Uh oh!

flinkbot commented Apr 21, 2026 •

edited

Loading

Uh oh!

spuru9 Apr 21, 2026

Uh oh!

hemmatio Apr 22, 2026

Uh oh!

spuru9 Apr 21, 2026

Uh oh!

hemmatio Apr 21, 2026

Uh oh!

gaborgsomogyi Apr 23, 2026

Uh oh!

gaborgsomogyi commented Apr 22, 2026

Uh oh!

gaborgsomogyi left a comment

Uh oh!

hemmatio commented Apr 22, 2026

Uh oh!

gaborgsomogyi commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hemmatio commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Was generative AI tooling used to co-author this PR?

Uh oh!

flinkbot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

spuru9 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

hemmatio Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

spuru9 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

hemmatio Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi commented Apr 22, 2026

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

hemmatio commented Apr 22, 2026

Uh oh!

gaborgsomogyi commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hemmatio commented Apr 21, 2026 •

edited

Loading

flinkbot commented Apr 21, 2026 •

edited

Loading