Implement graceful shutdown procedure#8851
Implement graceful shutdown procedure#8851whitslack wants to merge 2 commits intoElementsProject:masterfrom
Conversation
|
I just used this again to gracefully shut down my node, and this time my established channels count (but not my outstanding HTLCs count) reached zero, but then the node apparently restarted itself! I tried again, and the same thing happened again: the script works down the number of connected peers to zero, and then the daemon spontaneously restarts itself. I don't have any kind of service supervisor managing it, so it has to be something |
|
@whitslack are you able to try and resolve the CI errors so we can review and get this into 26.06? Planning for a release candidate on 11 May |
@madelinevibes: Honestly, probably not. No PR that I have ever opened on this project has passed CI checks, so I can't imagine they were ever working to begin with. It has been a continual source of frustration for me, as obviously I would like to submit PRs that pass all the checks, but the failures don't ever seem related to anything I've touched. In the present case I am no longer able to view the CI logs for this PR, so I wouldn't know where to begin. |
|
Very fair. |
|
Heh, we will restart if we spot the versions of subdaemons have changed underneath us. This usually is because you are running in place and do a |
When "snub-idle-channels" is set to true, lightningd will no longer spawn channeld subdaemons for channels that have no outstanding HTLCs, and it will cease trying to auto-reconnect to peers with whom we have no outstanding HTLCs. Incoming channel_reestablish messages for these idle channels will cause lightningd to reply to the peer with a warning explaining that we are temporarily declining to reestablish the channel. Since we do not send our own channel_reestablish, the peer is unable to add any HTLCs to the channel (or make any other updates to the channel). The reason we might want to do this is so we can halt a node gracefully by progressively snubbing more and more channels as they become idle until eventually we have no outstanding HTLCs whatsoever and also no possibility of any new HTLCs being added. At that point, we can safely take our node offline for an extended duration with no possibility that any of our channels will be unilaterally closed due to HTLC deadlines while we are offline. Changelog-Added: New `snub-idle-channels` dynamic config variable makes CLN temporarily stop spawning channeld subdaemons for channels with no HTLCs, as a means to achieve a safe node shutdown. Issue: ElementsProject#4842
This script utilizes the new "snub-idle-channels" knob to attempt to stop a CLN node gracefully. The script sets the snub flag and then starts forcibly disconnecting peers that have one or more reestablished channels but no outstanding HTLCs. When both the number of reestablished channels and the number of outstanding HTLCs reach zero, the script stops the node. If this does not occur before a user-specified timeout, then the script exits with an error and reports the block height and approximate time until the next outstanding HTLC expires. Changelog-Added: `contrib/lightning-graceful-stop.sh` attempts to stop a node without leaving any outstanding HTLCs. Closes: ElementsProject#4842
rustyrussell
left a comment
There was a problem hiding this comment.
OK, no. This is very clever, but instead I have based a different approach on this: implementing a graceful command which does a similar thing to your script (only it's up to you to call stop when it returns).
Not because this is a bad idea, but because using a transient setconfig is dangerous: if you don't make it transient you'll wonder why your node never works well!
BOLT 2 says:
We can abuse this requirement to implement a graceful shutdown procedure:
channel_reestablishmessages for any channels that have exactly zero outstanding HTLCs.This PR has two objectives:
snub-idle-channelsdynamic config variable that, when set totrue, makes lightningd:channeldsubdaemons for channels that have no outstanding HTLCs;channel_reestablishmessages for channels that have no outstanding HTLCs,contrib/lightning-graceful-stop.shscript that utilizessnub-idle-channelsto implement the graceful shutdown procedure outlined above.I have tested this graceful shutdown procedure on my own production node with great success. In under a minute my node dropped from over 30 outstanding HTLCs to 14, all of which were "stuck." The shutdown script reported that the next expiration was 140 blocks away, giving me plenty of time to power off my node and perform a hardware upgrade. If I had been willing to wait for all of my outstanding HTLCs to be resolved, then I could have stopped my node indefinitely with no danger of any forced unilateral closures. (Of course, my peers could still voluntarily choose to unilaterally close my channels with them if they grew tired of waiting for my node to reappear in the network, but that's not the concern that graceful shutdown is attempting to address.)
Note that there is still one edge case that this graceful shutdown strategy doesn't solve. If a peer has transmitted a new commitment containing a new HTLC, but we never transmitted our own new commitment containing that same new HTLC (either because we never received the peer's new commitment or because we restarted before we could send our own new commitment), then we will not know about (or will have forgotten) the new HTLC, and we will believe that the channel is safe to snub even though the peer would retransmit their new commitment containing the new HTLC if we allowed them to reestablish the channel. I am not certain, but it may be possible to use the fields in the
channel_reestablishmessage received from the peer to ascertain whether the peer has new HTLCs that they need to retransmit to us, and if they do, then we shouldn't snub the channel even if we are currently aware of no outstanding HTLCs in it.Checklist
Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:
tools/lightning-downgrade