Skip to content

Fix flaky handshake + set bootargs explicitly for NAND install#63

Merged
widgetii merged 2 commits intomasterfrom
fix/handshake-resilience
Apr 29, 2026
Merged

Fix flaky handshake + set bootargs explicitly for NAND install#63
widgetii merged 2 commits intomasterfrom
fix/handshake-resilience

Conversation

@widgetii
Copy link
Copy Markdown
Member

@widgetii widgetii commented Apr 29, 2026

Summary

Two real-hardware bugs fixed together (both verified on hi3516av200):

1. Flaky boot ROM handshake → reliable retry + better diagnostics

The transient "Failed to send DDR step" on first install (succeeds on retry with no code change) is fixed by three changes:

  • Distinct DDR-step error attribution. `_send_ddr_step` returns a phase-specific message instead of bool, so handshake-timeout vs PRESTEP0/DDRSTEP0/PRESTEP1 frame failures are distinguishable.
  • Drain-until-silent replaces fixed sleep + flush. New `Transport.drain_until_silent(quiet_period, max_wait)` loops until the line stays quiet — robust because a powered-off chip can't transmit. Previous `sleep(2.0) + flush_input` could miss bytes arriving during/after the flush.
  • Retry the handshake/DDR phase. `RecoverySession.run` wraps power-cycle + handshake + DDR-init in a retry loop (default 2 attempts) when programmatic power control is available. Past-DDR failures are not retried (slow, rarely transient).

2. `defib install` set wrong bootargs → kernel panic on boot

Recent OpenIPC U-Boot defaults to `rootfstype=squashfs` even when `rootfs.ubi` from the same release contains UBIFS, causing kernel panic ("Unable to mount root fs ... tried squashfs"). Fix: defib now `setenv bootargs` to match the rootfs format it just wrote (UBIFS for UBI images, squashfs+ubiblock otherwise) instead of trusting U-Boot's compiled-in default. Bootargs string built in a small `_nand_bootargs` helper for unit-testability.

Test plan

  • Unit tests: 16 new regression tests (11 for handshake resilience + 5 for bootargs), 350 passed total, 2 skipped
  • Lint clean (ruff)
  • Type check clean (mypy)
  • Real-hardware verification on hi3516av200:
    • Install: `Drained 2 stale bytes from serial` (drain_until_silent works), DDR step succeeded first try, no retry needed
    • Boot: `Kernel command line: root=ubi0:rootfs rootfstype=ubifs ...` (defib-set, not U-Boot default)
    • UBIFS mounts: `good PEBs: 944, bad PEBs: 0, corrupted PEBs: 0`
    • Init runs through: `Starting dropbear: OK`, login services come up

🤖 Generated with Claude Code

widgetii and others added 2 commits April 29, 2026 12:01
Three resilience fixes for the transient handshake failure observed on
hi3516av200 (one in N flashes failed at "Failed to send DDR step" with
no code change between successes):

1. Distinct DDR-step error attribution.
   _send_ddr_step now returns a phase-specific error string instead of a
   bool, so callers can tell handshake-timeout from PRESTEP0/DDRSTEP0/
   PRESTEP1 frame failures.  Previously every failure surfaced as
   "Failed to send DDR step" — misleading when the actual failure was
   the sendFrameForStart handshake never latching.

2. Drain-until-silent replaces fixed sleep + flush.
   Transport gets a drain_until_silent(quiet_period, max_wait) helper
   that loops reading until the line stays quiet long enough.  Session
   uses it after power_cycle.  More robust than a fixed 2s sleep +
   tcflush — a powered-off chip can't transmit, so silence is a
   deterministic ready signal, and we don't lose late-arriving stale
   bytes that beat the flush.

3. Retry the handshake/DDR phase on transient failure.
   RecoverySession.run wraps power_cycle + handshake + DDR-init in a
   retry loop (default 2 attempts) when programmatic power control is
   available.  Past-DDR failures are not retried since they're slow
   and rarely transient.  Without power control, retries are pointless
   and disabled.

11 regression tests in test_handshake_resilience.py cover all three
fixes — DDR-step error attribution, drain_until_silent quiet-period /
max-wait / idle-line behavior, and session retry behavior across all
relevant cases (transient handshake, post-DDR failure, no power
control, retries exhausted).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
defib install previously set ``mtdparts`` and ``bootcmd`` but relied on
U-Boot's compiled-in default ``bootargs``.  Recent OpenIPC U-Boot builds
default to:

    root=/dev/ubiblock0_0 rootfstype=squashfs ubi.block=0,0 init=/init

Even when the actual ``rootfs.ubi.hi3516av200`` from the same release
contains UBIFS.  Result: kernel boots the wrong filesystem driver and
panics with "Unable to mount root fs ... tried squashfs".

Fix: defib now ``setenv bootargs`` to match the rootfs format it just
wrote — UBIFS bootargs when the rootfs file is a UBI image (extracted
to UBIFS), squashfs+ubiblock bootargs otherwise.  Bootargs string built
in a small ``_nand_bootargs`` helper for unit-testability.

Tested on hi3516av200:
- Kernel command line now: root=ubi0:rootfs rootfstype=ubifs ...
- UBIFS mounts, init runs, dropbear/syslog start, login prompt reached

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@widgetii widgetii changed the title Fix flaky boot ROM handshake: retry, drain-to-silence, distinct errors Fix flaky handshake + set bootargs explicitly for NAND install Apr 29, 2026
@widgetii widgetii merged commit 1cea7ec into master Apr 29, 2026
13 checks passed
@widgetii widgetii deleted the fix/handshake-resilience branch April 29, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant