Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 69 additions & 3 deletions lightning/src/chain/chainmonitor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,11 @@ struct MonitorHolder<ChannelSigner: EcdsaChannelSigner> {
/// [`ChannelMonitorUpdate`] which was already applied. While this isn't an issue for the
/// LDK-provided update-based [`Persist`], it is somewhat surprising for users so we avoid it.
pending_monitor_updates: Mutex<Vec<u64>>,
/// Monitor update IDs for which the persister returned `Completed` but we overrode the return
/// to `InProgress` because the channel is post-close (`no_further_updates_allowed()` is true).
/// Resolved during the next monitor event release so that `ChannelManager` sees them complete
/// together with the force-close `MonitorEvent`s.
deferred_monitor_update_completions: Mutex<Vec<u64>>,
}

impl<ChannelSigner: EcdsaChannelSigner> MonitorHolder<ChannelSigner> {
Expand Down Expand Up @@ -1054,7 +1059,11 @@ where
if let Some(ref chain_source) = self.chain_source {
monitor.load_outputs_to_watch(chain_source, &self.logger);
}
entry.insert(MonitorHolder { monitor, pending_monitor_updates: Mutex::new(Vec::new()) });
entry.insert(MonitorHolder {
monitor,
pending_monitor_updates: Mutex::new(Vec::new()),
deferred_monitor_update_completions: Mutex::new(Vec::new()),
});

Ok(ChannelMonitorUpdateStatus::Completed)
}
Expand Down Expand Up @@ -1305,6 +1314,7 @@ where
entry.insert(MonitorHolder {
monitor,
pending_monitor_updates: Mutex::new(pending_monitor_updates),
deferred_monitor_update_completions: Mutex::new(Vec::new()),
});
Ok(persist_res)
}
Expand Down Expand Up @@ -1335,6 +1345,12 @@ where
let logger = WithChannelMonitor::from(&self.logger, &monitor, None);
log_trace!(logger, "Updating ChannelMonitor to id {}", update.update_id,);

// We hold `deferred_monitor_update_completions` through the entire
// update process so that `release_pending_monitor_events` either
// sees both the `MonitorEvent`s generated by `update_monitor` and
// the corresponding deferral entry, or neither.
let mut deferred =
monitor_state.deferred_monitor_update_completions.lock().unwrap();
// We hold a `pending_monitor_updates` lock through `update_monitor` to ensure we
// have well-ordered updates from the users' point of view. See the
// `pending_monitor_updates` docs for more.
Expand Down Expand Up @@ -1386,6 +1402,7 @@ where
ChannelMonitorUpdateStatus::UnrecoverableError => {
// Take the monitors lock for writing so that we poison it and any future
// operations going forward fail immediately.
core::mem::drop(deferred);
core::mem::drop(pending_monitor_updates);
core::mem::drop(monitors);
let _poison = self.monitors.write().unwrap();
Expand Down Expand Up @@ -1414,7 +1431,27 @@ where
}
}

if update_res.is_err() {
debug_assert!(
update_res.is_ok() || monitor.no_further_updates_allowed(),
"update_monitor returned Err but channel is not post-close",
);

if monitor.no_further_updates_allowed()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this from update_res.is_err?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we still can't return Completed for a successful post-close update if we haven't processed the events. We'd hit the mode check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the existing code already returned InProgress? I don't see why this swap made a difference to the return value (this is different from persist_res)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code didn't always return InProgress. We could have a closed channel monitor (that the chan mgr wasn't aware of yet), being fed first a new commitment (returning error), followed by a preimage update (returning ok) that would, in the existing code, propagate Completed up to the chan mgr, triggering the mode check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, yea, that makes sense. My concern, which maybe wasn't super clear, is that we might have some reason why we might want to return an Err from the update even though no_further_updates_allowed is false. no_further_updates_allowed is kinda "peeking past the curtain" here, instead I'd very much prefer to check update_res.is_err() and also if there are items in deferred_monitor_update_completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, yea, that makes sense. My concern, which maybe wasn't super clear, is that we might have some reason why we might want to return an Err from the update even though no_further_updates_allowed is false. no_further_updates_allowed is kinda "peeking past the curtain" here, instead I'd very much prefer to check update_res.is_err() and also if there are items in deferred_monitor_update_completion.

The problem is a re-entrancy issue during event processing:

  1. Updates 3 (commitment_signed) and 4 (preimage) arrive. Both get deferred. ChannelManager records them as InProgress.
  2. get_and_clear_pending_events() calls release_pending_monitor_events(). This returns a batch containing:
    - CommitmentTxConfirmed (from get_and_clear_pending_monitor_events)
    - Completed for updates 3 and 4 (from deferred.drain(..))
  3. ChannelManager iterates this batch in order. It processes CommitmentTxConfirmed first, which triggers a force close that calls update_channel with ChannelForceClosed (update 5).
  4. That update_channel call re-enters ChainMonitor. At this point update_res is Ok (post-close update type) and deferred is empty (was drained in step 2). So we return Completed.
  5. ChannelManager panics because updates 3 and 4 are still InProgress from its perspective; it hasn't reached the Completed events in the batch yet.

With the old no_further_updates_allowed() check, update 5 was always deferred because funding_spend_seen is a permanent property of the monitor. The new condition update_res.is_err() || !deferred.is_empty() loses that stickiness since deferred gets drained between step 2 and step 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how we should deal with the residual is_err cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, mmm, okay, then maybe we should check both? We can also add a debug_assert that if it errors we also have no_further_updates_allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure what to do for the is_err case if we check both. Added the debug_assert.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, not sure either honestly. ISTM we should at least try? Do what I described but accept the race as possible if we hit the case (which we shouldn't and we debug_assert it wont)?

&& persist_res == ChannelMonitorUpdateStatus::Completed
{
// The channel is post-close (funding spend seen, lockdown, or holder
// tx signed). Return InProgress so ChannelManager freezes the
// channel until the force-close MonitorEvents are processed. We
// track this update_id in deferred_monitor_update_completions so it
// gets resolved during release_pending_monitor_events, together with
// those MonitorEvents.
pending_monitor_updates.push(update_id);
deferred.push(update_id);
log_debug!(
logger,
"Deferring completion of ChannelMonitorUpdate id {:?} (channel is post-close)",
update_id,
);
ChannelMonitorUpdateStatus::InProgress
} else {
persist_res
Expand All @@ -1429,8 +1466,13 @@ where
for (channel_id, update_id) in self.persister.get_and_clear_completed_updates() {
let _ = self.channel_monitor_updated(channel_id, update_id);
}
let monitors = self.monitors.read().unwrap();
let mut pending_monitor_events = self.pending_monitor_events.lock().unwrap().split_off(0);
for monitor_state in self.monitors.read().unwrap().values() {
for monitor_state in monitors.values() {
// Hold the deferred lock across both `get_and_clear_pending_monitor_events`
// and the deferred resolution so that `update_monitor` cannot insert a
// `MonitorEvent` and deferral entry between the two steps.
let mut deferred = monitor_state.deferred_monitor_update_completions.lock().unwrap();
let monitor_events = monitor_state.monitor.get_and_clear_pending_monitor_events();
if monitor_events.len() > 0 {
let monitor_funding_txo = monitor_state.monitor.get_funding_txo();
Expand All @@ -1443,6 +1485,30 @@ where
counterparty_node_id,
));
}
// Resolve any deferred monitor update completions after collecting regular monitor
// events. The regular events include the force-close (CommitmentTxConfirmed), which
// ChannelManager processes first. The deferred completions come after, so that
// completion actions resolve once the ChannelForceClosed update (generated during
// force-close processing) also gets deferred and resolved in the next event cycle.
let mut pending = monitor_state.pending_monitor_updates.lock().unwrap();
for update_id in deferred.drain(..) {
pending.retain(|id| *id != update_id);
let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);
if !monitor_is_pending_updates {
let funding_txo = monitor_state.monitor.get_funding_txo();
let channel_id = monitor_state.monitor.channel_id();
pending_monitor_events.push((
funding_txo,
channel_id,
vec![MonitorEvent::Completed {
funding_txo,
channel_id,
monitor_update_id: monitor_state.monitor.get_latest_update_id(),
}],
monitor_state.monitor.get_counterparty_node_id(),
));
}
Comment on lines +1494 to +1510
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential duplicate MonitorEvent::Completed: if channel_monitor_updated (public API, or via get_and_clear_completed_updates) processes a deferred update_id before this drain runs, that update will already have been removed from pending and a Completed event will already have been emitted. Then this drain loop will call retain (no-op since the id is gone), see pending is empty, and emit a second Completed for the same monitor.

In practice this requires a buggy Persist implementation reporting the same update both synchronously (returning Completed) and asynchronously (via get_and_clear_completed_updates), or external misuse of force_channel_monitor_updated. The impact is low since ChannelManager::channel_monitor_updated handles duplicate Completed events idempotently. But a defensive check would be cheap:

Suggested change
for update_id in deferred.drain(..) {
pending.retain(|id| *id != update_id);
let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);
if !monitor_is_pending_updates {
let funding_txo = monitor_state.monitor.get_funding_txo();
let channel_id = monitor_state.monitor.channel_id();
pending_monitor_events.push((
funding_txo,
channel_id,
vec![MonitorEvent::Completed {
funding_txo,
channel_id,
monitor_update_id: monitor_state.monitor.get_latest_update_id(),
}],
monitor_state.monitor.get_counterparty_node_id(),
));
}
for update_id in deferred.drain(..) {
let prev_len = pending.len();
pending.retain(|id| *id != update_id);
if prev_len == pending.len() {
// Already resolved by channel_monitor_updated; skip to avoid
// duplicate MonitorEvent::Completed.
continue;
}
let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);
if !monitor_is_pending_updates {
let funding_txo = monitor_state.monitor.get_funding_txo();
let channel_id = monitor_state.monitor.channel_id();
pending_monitor_events.push((
funding_txo,
channel_id,
vec![MonitorEvent::Completed {
funding_txo,
channel_id,
monitor_update_id: monitor_state.monitor.get_latest_update_id(),
}],
monitor_state.monitor.get_counterparty_node_id(),
));
}
}

}
}
pending_monitor_events
}
Expand Down
1 change: 1 addition & 0 deletions lightning/src/chain/channelmonitor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6819,6 +6819,7 @@ mod tests {
let legacy_cfg = test_legacy_channel_config();
let node_chanmgrs = create_node_chanmgrs(3, &node_cfgs, &[Some(legacy_cfg.clone()), Some(legacy_cfg.clone()), Some(legacy_cfg)]);
let nodes = create_network(3, &node_cfgs, &node_chanmgrs);
nodes[1].disable_monitor_completeness_assertion();
let channel = create_announced_chan_between_nodes(&nodes, 0, 1);
create_announced_chan_between_nodes(&nodes, 1, 2);

Expand Down
12 changes: 4 additions & 8 deletions lightning/src/chain/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -233,11 +233,10 @@ pub enum ChannelMonitorUpdateStatus {
/// This includes performing any `fsync()` calls required to ensure the update is guaranteed to
/// be available on restart even if the application crashes.
///
/// If you return this variant, you cannot later return [`InProgress`] from the same instance of
/// [`Persist`]/[`Watch`] without first restarting.
/// You cannot switch from [`InProgress`] to this variant for the same channel without first
/// restarting. However, switching from this variant to [`InProgress`] is always allowed.
///
/// [`InProgress`]: ChannelMonitorUpdateStatus::InProgress
/// [`Persist`]: chainmonitor::Persist
Completed,
/// Indicates that the update will happen asynchronously in the background or that a transient
/// failure occurred which is being retried in the background and will eventually complete.
Expand All @@ -263,12 +262,7 @@ pub enum ChannelMonitorUpdateStatus {
/// reliable, this feature is considered beta, and a handful of edge-cases remain. Until the
/// remaining cases are fixed, in rare cases, *using this feature may lead to funds loss*.
///
/// If you return this variant, you cannot later return [`Completed`] from the same instance of
/// [`Persist`]/[`Watch`] without first restarting.
///
/// [`InProgress`]: ChannelMonitorUpdateStatus::InProgress
/// [`Completed`]: ChannelMonitorUpdateStatus::Completed
/// [`Persist`]: chainmonitor::Persist
InProgress,
/// Indicates that an update has failed and will not complete at any point in the future.
///
Expand Down Expand Up @@ -328,6 +322,8 @@ pub trait Watch<ChannelSigner: EcdsaChannelSigner> {
/// cannot be retried, the node should shut down immediately after returning
/// [`ChannelMonitorUpdateStatus::UnrecoverableError`], see its documentation for more info.
///
/// See [`ChannelMonitorUpdateStatus`] for requirements on when each variant may be returned.
///
/// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager
fn update_channel(
&self, channel_id: ChannelId, update: &ChannelMonitorUpdate,
Expand Down
Loading
Loading