Skip to content

Fix hi3516av300 install: SPL boundary, FF padding, write timeout#65

Merged
widgetii merged 1 commit intomasterfrom
fix/hi3516av300-spl-and-uboot-upload
May 4, 2026
Merged

Fix hi3516av300 install: SPL boundary, FF padding, write timeout#65
widgetii merged 1 commit intomasterfrom
fix/hi3516av300-spl-and-uboot-upload

Conversation

@widgetii
Copy link
Copy Markdown
Member

@widgetii widgetii commented May 4, 2026

End-to-end install of OpenIPC firmware on hi3516av300 hung mid-transfer in two places. This PR fixes both, plus the supporting infrastructure that masked the failures as indefinite hangs.

Why

End-to-end defib install -c hi3516av300 --firmware openipc.hi3516av300-nor-neo.tgz failed silently — process blocked indefinitely with no error.

Two distinct hi3516cv500-family bootrom quirks, plus the SerialTransport.write() blocking forever on stalled writes (which masked everything as a hang).

SPL boundary detection (gzip + round-down)

OpenIPC's hi3516av300 universal U-Boot uses gzip (not LZMA) to compress the embedded U-Boot payload. The previous _detect_spl_size only scanned for LZMA, found nothing, and fell back to the profile default FILELEN[1]=0x6000 (24 KB). That overshoots the actual 21 KB SPL code by 3 KB into SRAM that the bootrom uses for its own working memory — the cv500-family bootrom hangs the moment we start overwriting it.

PR #55's max(detected, profile_max) was correct for HiTool reference SPLs (which fill the full window) and for SVB-enabled av200 (where detected > profile_max). It is wrong for OpenIPC builds that are more compact than HiTool's reference: we must trust the detected boundary even when it's smaller than profile_max.

Two changes:

  • Detect gzip (1f 8b 08) in addition to LZMA.
  • Round the boundary down to the nearest 1 KB so we never include any bytes of the compressed payload (the previous round-up included ~272 bytes of gzip data past the boundary).

SPL TAIL non-fatal for av200/av300

When prestep_data is set (av200/av300/sendFrameForStart chips), the SPL detaches the bootrom protocol handler as soon as it receives the declared byte count, so the SPL TAIL frame is never ACKed. Treating that as fatal stalled the SPL stage. Mirrors the existing best-effort TAIL handling for U-Boot on these same chips.

U-Boot upload: zero long 0xFF runs

After the SPL boundary fix, the U-Boot upload to DDR reproducibly hung at chunk 21 with the exact same byte content. Bisection narrowed the trigger to a 12-byte run of 0xFF padding between the end of the SPL code and the start of the gzip header (byte offsets 0x52E4..0x52EF of the av300 universal U-Boot).

Confirmed root cause:

  • 11 consecutive 0xFF bytes: PASS
  • 12 consecutive 0xFF bytes: HANG mid-DATA frame, no ACK ever
  • Patching those 12 bytes to 0x00: full 248 KB U-Boot uploads cleanly and the rest of the install completes.

Almost certainly a UART RX-path quirk in the cv500-family bootrom. The 0xFF runs are inert padding, never executed by anything, so zeroing them in _send_uboot is safe. Threshold of 12 matches the empirically observed boundary.

Write timeout via pyserial write_timeout

Both bugs above presented as indefinite hangs, not errors, because SerialTransport.write had no timeout. pyserial's blocking write() blocked in pselect6 forever when the kernel TX buffer stopped draining (because the device stopped accepting bytes).

asyncio.wait_for on a run_in_executor future does not help — cancelling the asyncio task can't interrupt a thread blocked in a syscall. The fix is to set port.write_timeout = 5.0 so pyserial itself returns and raises SerialTimeoutException, which we map to TransportTimeout. 5 s ceiling: a 1 KB write at 115200 baud is ~89 ms.

Retry-loop catches write timeouts

Previously transport.write(frame_data) was called outside the retry loop's try/except TransportTimeout. A transient write failure would propagate up the stack and bypass retry. Moving the write inside the try block makes write timeouts symmetric with read timeouts — both are retried.

install: --nor-size 32

The install subcommand only supported 8 MB and 16 MB NOR layouts. Added a 32 MB layout (256 KB boot, 64 KB env, 3 MB kernel, 24 MB rootfs, rest rootfs_data). OpenIPC U-Boot defines setnor8m and setnor16m env vars but not setnor32m, so for the 32 MB case we send the raw mtdparts string inline instead of run mtdpartsnor32m.

Verification

End-to-end on a real hi3516av300 (Vstarcam, IMX415):

```
Phase 1: Burning U-Boot to RAM ✓ 32 s
SPL boundary detected (gzip) at 0x5000 (20480 bytes);
profile default was 0x6000 (24576 bytes)
_zero_long_ff_runs: zeroed 12 0xFF bytes at offset 0x52E4
DDR step / SPL / U-Boot complete ✓
Phase 2: Flash via TFTP
U-Boot → 0x000000 (248775 B) Flash verified DF75ECC7
kernel → 0x050000 (1987111 B) Flash verified 49DA597E
rootfs → 0x350000 (7499776 B) Flash verified AF769A05
Setting boot environment ✓
Resetting device ✓
Install complete! Device is rebooting into OpenIPC.
```

12 new regression tests in tests/test_protocol_standard.py:

  • TestDetectSplSize — gzip + LZMA detection, round-down, no-marker fallback, scan-window bounds
  • TestZeroLongFfRuns — threshold edge (11 vs 12), end-of-buffer, multiple runs, no-op short-circuit, exact av300 padding pattern
  • TestSplTailNonFatalForFrameBlast — succeeds on av300 without TAIL ACK; still strictly fails on chips without prestep_data
  • TestWriteTimeoutRetry_send_frame_with_retry recovers from transient write timeouts

402 tests pass. ruff clean. mypy clean on changed files.

Test plan

  • uv run pytest tests/ -x --ignore=tests/fuzz (402 pass)
  • uv run defib install -c hi3516av300 --firmware <openipc.tgz> -p <port> --power-cycle --nor-size 32 against a real hi3516av300 NOR-32 board
  • No regression on hi3516av200 install (uses the same _detect_spl_size + _zero_long_ff_runs paths; av200 LZMA detection still trusted as before)
  • No regression on chips without prestep_data (SPL TAIL still treated as fatal — covered by test_spl_tail_no_ack_fails_for_chip_without_prestep)

🤖 Generated with Claude Code

End-to-end install of OpenIPC firmware on hi3516av300 hung mid-transfer
in two places. This PR fixes both, plus the supporting infrastructure
that masked the failures as indefinite hangs.

## SPL boundary detection (gzip + round-down)

OpenIPC's hi3516av300 universal U-Boot uses **gzip** (not LZMA) to
compress the embedded U-Boot payload. The previous `_detect_spl_size`
only scanned for LZMA, found nothing, and fell back to the profile
default `FILELEN[1]=0x6000` (24 KB). That overshoots the actual
21 KB SPL code by 3 KB into SRAM that the bootrom uses for its own
working memory — the cv500-family bootrom hangs the moment we start
overwriting it.

PR #55's `max(detected, profile_max)` was correct for HiTool reference
SPLs (which fill the full window) and for SVB-enabled av200 (where
detected > profile_max). It is wrong for OpenIPC builds that are more
compact than HiTool's reference: we must trust the detected boundary
even when it's smaller than profile_max.

Two changes:
- Detect gzip (`1f 8b 08`) in addition to LZMA.
- Round the boundary **down** to the nearest 1 KB so we never include
  any bytes of the compressed payload (the previous round-up included
  ~272 bytes of gzip data past the boundary).

## SPL TAIL non-fatal for av200/av300

When prestep_data is set (av200/av300/sendFrameForStart chips), the
SPL detaches the bootrom protocol handler as soon as it receives the
declared byte count, so the SPL TAIL frame is never ACKed. Treating
that as fatal stalled the SPL stage. Mirrors the existing best-effort
TAIL handling for U-Boot on these same chips.

## U-Boot upload: zero long 0xFF runs

After the SPL boundary fix, the U-Boot upload to DDR reproducibly
hung at chunk 21 with the *exact* same byte content. Bisection
narrowed the trigger to a 12-byte run of `0xFF` padding between the
end of the SPL code and the start of the gzip header (byte offsets
`0x52E4..0x52EF` of the av300 universal U-Boot).

Confirmed root cause:
- 11 consecutive `0xFF` bytes: PASS
- 12 consecutive `0xFF` bytes: HANG mid-DATA frame, no ACK ever
- Patching those 12 bytes to `0x00`: full 248 KB U-Boot uploads
  cleanly and the rest of the install completes.

Almost certainly a UART RX-path quirk in the cv500-family bootrom
(possibly a buffer-empty pattern detector). The 0xFF runs are inert
padding, never executed by anything, so zeroing them in
`_send_uboot` is safe. Threshold of 12 matches the empirically
observed boundary.

## Write timeout via pyserial `write_timeout`

Both bugs above presented as **indefinite hangs**, not errors,
because `SerialTransport.write` had no timeout. pyserial's blocking
`write()` blocked in `pselect6` forever when the kernel TX buffer
stopped draining (because the device stopped accepting bytes).

`asyncio.wait_for` on a `run_in_executor` future does not help —
cancelling the asyncio task can't interrupt a thread blocked in a
syscall. The fix is to set `port.write_timeout = 5.0` so pyserial
itself returns and raises `SerialTimeoutException`, which we map to
`TransportTimeout`. 5 s ceiling: a 1 KB write at 115200 baud is
~89 ms.

## Retry-loop catches write timeouts

Previously `transport.write(frame_data)` was called *outside* the
retry loop's `try/except TransportTimeout`. A transient write
failure would propagate up the stack and bypass retry. Moving the
write inside the try block makes write timeouts symmetric with read
timeouts — both are retried.

## install: `--nor-size 32`

The `install` subcommand only supported 8 MB and 16 MB NOR layouts.
Added a 32 MB layout (256 KB boot, 64 KB env, 3 MB kernel, 24 MB
rootfs, rest rootfs_data). OpenIPC U-Boot defines `setnor8m` and
`setnor16m` env vars but not `setnor32m`, so for the 32 MB case we
send the raw mtdparts string inline instead of `run mtdpartsnor32m`.

## Verification

End-to-end on a real hi3516av300 (Vstarcam, IMX415):

```
Phase 1: Burning U-Boot to RAM                        ✓ 32 s
  SPL boundary detected (gzip) at 0x5000 (20480 bytes);
  profile default was 0x6000 (24576 bytes)
  _zero_long_ff_runs: zeroed 12 0xFF bytes at offset 0x52E4
  DDR step / SPL / U-Boot complete                    ✓
Phase 2: Flash via TFTP
  U-Boot   → 0x000000 (248775 B)  Flash verified DF75ECC7
  kernel   → 0x050000 (1987111 B) Flash verified 49DA597E
  rootfs   → 0x350000 (7499776 B) Flash verified AF769A05
Setting boot environment                              ✓
Resetting device                                      ✓
Install complete! Device is rebooting into OpenIPC.
```

12 new regression tests in `tests/test_protocol_standard.py`:
- `TestDetectSplSize` — gzip + LZMA detection, round-down, no-marker
  fallback, scan-window bounds
- `TestZeroLongFfRuns` — threshold edge (11 vs 12), end-of-buffer,
  multiple runs, no-op short-circuit, exact av300 padding pattern
- `TestSplTailNonFatalForFrameBlast` — succeeds on av300 without
  TAIL ACK; still strictly fails on chips without prestep_data
- `TestWriteTimeoutRetry` — `_send_frame_with_retry` recovers from
  transient write timeouts

402 tests pass. ruff clean. mypy clean on changed files.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@widgetii widgetii merged commit 9c9ea03 into master May 4, 2026
13 checks passed
@widgetii widgetii deleted the fix/hi3516av300-spl-and-uboot-upload branch May 4, 2026 19:19
widgetii added a commit that referenced this pull request May 5, 2026
## Summary

Third agent platform after ev300 (V4) and cv300 (V3): the
**cv500-family** generation (V5) — Cortex-A7 with the same memory map
shared by cv500, av300, and dv300.

Most of the wiring already existed. The Cortex-A7 startup path with MMU
+ I/D cache from #67 covers cv500-family. PR #65 already taught
\`hisilicon_standard.py\` the cv500-family bootrom quirks (12-byte 0xFF
run zeroing, gzip SPL boundary, non-fatal SPL TAIL when prestep_data is
set). Two real fixes were needed to make av300 actually run.

## Fixes

### 1. Wrong UART base in `agent/Makefile` cv500 entry

The cv500 block had \`UART_BASE = 0x12100000\` copy-pasted from the
V3/hi3518ev200 layout. qemu-hisilicon's \`hi3516cv500_soc\` (which
models the whole cv500 family, av300 included) puts UART0 at
**\`0x120A0000\`**. QEMU surfaced it instantly: agent ran silently
because writes went to unmapped I/O. Fix: \`0x120A0000\`.

### 2. SPL boundary detected from the wrong buffer

\`_send_spl()\` in the agent-flash flow takes both \`firmware\` (the
agent binary) and \`spl_override\` (OpenIPC u-boot used as the SPL
stage), but \`_detect_spl_size\` was scanning \`firmware\`. The agent
has no compressed payload, so the scan fell through to \`profile_max =
0x6000\` — overshooting the real SPL code (ends at \`0x5000\` on av300)
by 0x1000 B. Those extra bytes include the **12-byte 0xFF padding at
\`0x52E4\`** that PR #65 identified as the cv500-family bootrom RX-hang
trigger. Result: SPL DATA chunk #21 stalls with no ACK, full 32-retry
exhaustion.

Fix: detect the SPL boundary from \`spl_override\` when present. Plus
defense-in-depth: apply \`_zero_long_ff_runs\` to \`spl_data\` so any
≥12-byte 0xFF run that does slip through (e.g. a non-OpenIPC SPL build
with FF padding earlier in the binary) doesn't trip the same bug.

### 3. Aliasing in \`chip_to_agent\`

Match the existing \`gk7205v300 → gk7205v200\` pattern: one binary,
multiple chip names route to it. Add \`hi3516av300 → hi3516cv500\` and
\`hi3516dv300 → hi3516cv500\` to \`get_agent_binary()\`. No new Makefile
entry needed.

### 4. Cosmetic profile name fix

Both \`hi3516av300.json\` and \`hi3516dv300.json\` had \`"name":
"hi3516cv500"\` (copy-paste artifact — the profile loader keys off
filename so this is harmless, but inconsistent).

## Verification

QEMU smoke test: \`qemu-system-arm -M hi3516cv500\` runs the agent
cleanly, READY/DEFIB packet stream, no faults.

Real **hi3516av300** board (\`/dev/uart-hi3516av300_imx415\`, MikroTik
\`ether8\`):
\`\`\`
jedec=c22019 (Macronix MX25L256, 32 MiB), ram=0x80000000, caps=0x7f,
version=2
256 KiB @ 921600: 3.04 s = 84.3 KB/s
1 MiB sustained @ 921600: 11.74 s = 87.2 KB/s
flash bytes match installed u-boot byte-for-byte
\`\`\`

Cross-platform throughput summary (1 MiB read, 921600 baud):

| | CPU | Generation | KB/s |
|---|---|---|---|
| ev300 | Cortex-A7 | V4 | 87.1 |
| cv300 | ARM926 | V3 | 89.0 |
| **av300** | **Cortex-A7** | **V5 (cv500-family)** | **87.2** |

All three within ~2 KB/s — UART baud is the bottleneck.

\`\`\`
make -C agent test HOST_CC=gcc:    5406/5406
pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
ruff check src/ tests/:             All checks passed
mypy src/defib/ --ignore-missing-imports: no issues found in 55 source
files
\`\`\`

dv300 has no test board attached — it inherits the same binary as a
silent alias. av300 hardware proves the same codepath cv500/dv300 will
take.

## Test plan

- [x] QEMU \`-M hi3516cv500\` boots agent cleanly (caught the wrong UART
base)
- [x] Real av300 hardware: agent upload, info, flash read at 921600
baud, content verified
- [x] All test suites green (host C, pytest, ruff, mypy)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Dmitry Ilyin <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
widgetii added a commit that referenced this pull request May 5, 2026
## Summary

Fourth agent platform after **ev300 (V4)**, **cv300 (V3)**, and
**av300/dv300/cv500 (V5/cv500-family)**. V3A = `3519v101 + av200` —
Cortex-A7 with V3-era peripheral addresses (UART `0x12100000`, WDT
`0x12080000`) but DDR at `0x80000000` like cv500-family, per
qemu-hisilicon's `hi3519v101_soc`.

The bootrom-protocol quirks (`sendFrameForStart` handshake, `PRESTEP1`
DDR training step, non-fatal TAILs) were already landed for `defib
install` / `defib burn` in #47 + #48 + #65. This PR is just the agent
build wiring plus one real protocol fix the agent-flash path was
missing.

## The fix: don't pre-truncate `spl_override` at the call site

`defib agent upload` / `agent flash` were doing:

\`\`\`python
spl_data = cached_fw.read_bytes()[:profile.spl_max_size]
\`\`\`

before passing to `send_firmware()`. When `_send_spl()` then scans this
truncated buffer for the LZMA/gzip SPL boundary, it can't find anything
past `profile_max` — so for chips where the OpenIPC SPL is *larger* than
the HiTool reference (e.g. **av200's SVB-enabled SPL is `0x6800`, but
`profile_max` is `0x4F00`**), we send `0x1900` too few bytes. The SPL
never finishes its post-DDR-init code, the SPL TAIL completes with no
follow-through, and the agent HEAD frame for `0x81000000` gets `0x08`
rejection.

Fix: pass the full u-boot binary as `spl_override`. `_send_spl()`
already handles the slicing via its detected LZMA/gzip boundary.

## Verification

**QEMU** `qemu-system-arm -M hi3519v101 -kernel agent-hi3519v101.elf`:
agent boots cleanly, READY/DEFIB packet stream, no faults.

**Real hi3516av200** board (`/dev/ttyUSB1`, MikroTik `ether8`):

\`\`\`
upload ok=True
agent ready: ram_base=0x80000000 caps=0x7f version=2
\`\`\`

The board in our lab has SPI NAND and the agent's NOR-only flash driver
returns shifted JEDEC bytes / 0-byte reads on NAND; that's a separate
larger limitation noted at the bottom.

**hi3516cv300 regression** (`/dev/uart-IVGHP203Y-AF`, MikroTik
`ether3`):

\`\`\`
agent ready: jedec=ef4018 flash=16384KiB ram=0x80000000 caps=0x7f
256 KiB @ 921600: 3.02 s = 84.9 KB/s
\`\`\`

The spl_override-truncation fix changes cv300's SPL size from `0x4F00`
(clamped) to `0x5400` (LZMA-detected at offset `0x54c8`). Agent loads
identically, throughput unchanged — the previous undersized SPL was
working coincidentally because cv300's SPL pieces happened to fit in the
smaller window.

\`\`\`
make -C agent test HOST_CC=gcc:    5406/5406
pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
ruff & mypy: clean
\`\`\`

All four agent SoCs (ev300, cv300, cv500, 3519v101) build clean.

## Aliasing

Following the existing `gk7205v300 → gk7205v200` shape: one binary
serves the family, multiple chip names route to it.

\`\`\`python
"hi3519v101": "hi3519v101",
"hi3516av200": "hi3519v101",   # 3519v101 family, same memory map
\`\`\`

## Known limitation: SPI NAND on av200 boards

The av200 board in our lab has **SPI NAND** flash. The agent's flash
driver (`agent/spi_flash.c`) supports SPI NOR only — uses the
memory-mapped read window at `0x14000000` and direct FMC register
commands. On a NAND board:

- Agent loads, runs, and emits READY ✓
- `agent info` returns shifted JEDEC bytes (e.g. `00c212` instead of
valid `c2 XX YY`)
- `agent read` returns 0 bytes
- Erase/write/scan won't work

This affects all V3-and-later HiSilicon chips that ship with SPI NAND
(some av200, some av300, some cv500). Adding SPI NAND support to the
agent is its own piece of work. This PR ships the platform wiring; a
follow-up can address NAND.

## Test plan

- [x] QEMU `-M hi3519v101`: agent boots cleanly
- [x] Real av200 hardware: agent uploads + runs + READY
- [x] cv300 regression: throughput and behavior unchanged
- [x] All test suites green

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Dmitry Ilyin <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant