fix: fix confignode and ainode restart function#17862
Open
Alchuang22-dev wants to merge 7 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves restart robustness for ConfigNode and AINode by explicitly modeling partial startup states, tightening restart validation, and ensuring startup failures are surfaced rather than allowing unsafe recovery paths.
Changes:
- Add explicit ConfigNode startup state classification (first start / restart / partial / corrupted) and enforce stricter startup gating.
- Propagate consensus initialization failures and delay ConfigNode
system.propertiespersistence until post-consensus + leader election + seed apply succeeds. - Harden AINode restart behavior to preserve local
ainode_id, avoid implicit re-registration, and back up localsystem.propertieson failure; add unit tests across these behaviors.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/service/ConfigNodeStartupPersistenceTest.java | New tests ensuring system.properties is only persisted after successful seed apply. |
| iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/manager/node/NodeManagerTest.java | New tests for applyConfigNode failure propagation and AINode restart update error handling. |
| iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/manager/node/ClusterNodeStartUtilsTest.java | New tests rejecting AINode restart when endpoint changes. |
| iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/manager/consensus/ConsensusManagerTest.java | Expanded tests for consensus start behavior on first start vs restart and failure cases. |
| iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/conf/SystemPropertiesUtilsTest.java | New tests for startup state classification (first/restart/partial/corrupted). |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/service/ConfigNode.java | Persist system properties only after successful seed apply via consensus. |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/node/NodeManager.java | Return/propagate failure statuses for config node apply and harden AINode restart updates. |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/node/ClusterNodeStartUtils.java | Add AINode endpoint-change detection during restart confirmation. |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/consensus/ConsensusManager.java | Make consensus start fail fast for invalid startup states and on local peer creation failures. |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/conf/SystemPropertiesUtils.java | Introduce explicit startup states and stricter restart detection across system+consensus storage. |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/conf/ConfigNodeStartupCheck.java | Enforce startup gating based on explicit startup state classification. |
| iotdb-core/ainode/tests/test_ai_node_restart.py | New unit tests for AINode restart failure handling and local state preservation. |
| iotdb-core/ainode/iotdb/ainode/core/rpc/client.py | Make node_restart return status and avoid implicit re-registration messaging on rejection. |
| iotdb-core/ainode/iotdb/ainode/core/config.py | Load system.properties after applying configured system dir and preserve unregistered id default. |
| iotdb-core/ainode/iotdb/ainode/core/ai_node.py | Add backup + atomic write helpers for system.properties and harden restart path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+126
to
+128
| if (file.isFile() || !file.isDirectory()) { | ||
| return LocalStorageState.PRESENT; | ||
| } |
Comment on lines
+338
to
+341
| if "ain_inference_model_mem_usage_map" in config_keys: | ||
| self._config.set_ain_inference_model_mem_usage_map( | ||
| eval(file_configs["ain_inference_model_mem_usage_map"]) | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Improve restart robustness for ConfigNode and AINode by making partial startup states explicit and preventing unsafe recovery paths.
Content
Here are the changes:
ConsensusManager.start()so consensus initialization failures are propagated instead of being silently ignored.system.propertiesonly after consensus group creation, leader election, and seed ConfigNode application all succeed.ainode_id, back up oldsystem.propertieson failure, and avoid implicit re-registration with a new AINode id.Added unit tests for:
This PR has:
for an unfamiliar reader.
for code coverage.
PTAL.