Fix hi3516av300 install: SPL boundary, FF padding, write timeout#65
Merged
Fix hi3516av300 install: SPL boundary, FF padding, write timeout#65
Conversation
End-to-end install of OpenIPC firmware on hi3516av300 hung mid-transfer in two places. This PR fixes both, plus the supporting infrastructure that masked the failures as indefinite hangs. ## SPL boundary detection (gzip + round-down) OpenIPC's hi3516av300 universal U-Boot uses **gzip** (not LZMA) to compress the embedded U-Boot payload. The previous `_detect_spl_size` only scanned for LZMA, found nothing, and fell back to the profile default `FILELEN[1]=0x6000` (24 KB). That overshoots the actual 21 KB SPL code by 3 KB into SRAM that the bootrom uses for its own working memory — the cv500-family bootrom hangs the moment we start overwriting it. PR #55's `max(detected, profile_max)` was correct for HiTool reference SPLs (which fill the full window) and for SVB-enabled av200 (where detected > profile_max). It is wrong for OpenIPC builds that are more compact than HiTool's reference: we must trust the detected boundary even when it's smaller than profile_max. Two changes: - Detect gzip (`1f 8b 08`) in addition to LZMA. - Round the boundary **down** to the nearest 1 KB so we never include any bytes of the compressed payload (the previous round-up included ~272 bytes of gzip data past the boundary). ## SPL TAIL non-fatal for av200/av300 When prestep_data is set (av200/av300/sendFrameForStart chips), the SPL detaches the bootrom protocol handler as soon as it receives the declared byte count, so the SPL TAIL frame is never ACKed. Treating that as fatal stalled the SPL stage. Mirrors the existing best-effort TAIL handling for U-Boot on these same chips. ## U-Boot upload: zero long 0xFF runs After the SPL boundary fix, the U-Boot upload to DDR reproducibly hung at chunk 21 with the *exact* same byte content. Bisection narrowed the trigger to a 12-byte run of `0xFF` padding between the end of the SPL code and the start of the gzip header (byte offsets `0x52E4..0x52EF` of the av300 universal U-Boot). Confirmed root cause: - 11 consecutive `0xFF` bytes: PASS - 12 consecutive `0xFF` bytes: HANG mid-DATA frame, no ACK ever - Patching those 12 bytes to `0x00`: full 248 KB U-Boot uploads cleanly and the rest of the install completes. Almost certainly a UART RX-path quirk in the cv500-family bootrom (possibly a buffer-empty pattern detector). The 0xFF runs are inert padding, never executed by anything, so zeroing them in `_send_uboot` is safe. Threshold of 12 matches the empirically observed boundary. ## Write timeout via pyserial `write_timeout` Both bugs above presented as **indefinite hangs**, not errors, because `SerialTransport.write` had no timeout. pyserial's blocking `write()` blocked in `pselect6` forever when the kernel TX buffer stopped draining (because the device stopped accepting bytes). `asyncio.wait_for` on a `run_in_executor` future does not help — cancelling the asyncio task can't interrupt a thread blocked in a syscall. The fix is to set `port.write_timeout = 5.0` so pyserial itself returns and raises `SerialTimeoutException`, which we map to `TransportTimeout`. 5 s ceiling: a 1 KB write at 115200 baud is ~89 ms. ## Retry-loop catches write timeouts Previously `transport.write(frame_data)` was called *outside* the retry loop's `try/except TransportTimeout`. A transient write failure would propagate up the stack and bypass retry. Moving the write inside the try block makes write timeouts symmetric with read timeouts — both are retried. ## install: `--nor-size 32` The `install` subcommand only supported 8 MB and 16 MB NOR layouts. Added a 32 MB layout (256 KB boot, 64 KB env, 3 MB kernel, 24 MB rootfs, rest rootfs_data). OpenIPC U-Boot defines `setnor8m` and `setnor16m` env vars but not `setnor32m`, so for the 32 MB case we send the raw mtdparts string inline instead of `run mtdpartsnor32m`. ## Verification End-to-end on a real hi3516av300 (Vstarcam, IMX415): ``` Phase 1: Burning U-Boot to RAM ✓ 32 s SPL boundary detected (gzip) at 0x5000 (20480 bytes); profile default was 0x6000 (24576 bytes) _zero_long_ff_runs: zeroed 12 0xFF bytes at offset 0x52E4 DDR step / SPL / U-Boot complete ✓ Phase 2: Flash via TFTP U-Boot → 0x000000 (248775 B) Flash verified DF75ECC7 kernel → 0x050000 (1987111 B) Flash verified 49DA597E rootfs → 0x350000 (7499776 B) Flash verified AF769A05 Setting boot environment ✓ Resetting device ✓ Install complete! Device is rebooting into OpenIPC. ``` 12 new regression tests in `tests/test_protocol_standard.py`: - `TestDetectSplSize` — gzip + LZMA detection, round-down, no-marker fallback, scan-window bounds - `TestZeroLongFfRuns` — threshold edge (11 vs 12), end-of-buffer, multiple runs, no-op short-circuit, exact av300 padding pattern - `TestSplTailNonFatalForFrameBlast` — succeeds on av300 without TAIL ACK; still strictly fails on chips without prestep_data - `TestWriteTimeoutRetry` — `_send_frame_with_retry` recovers from transient write timeouts 402 tests pass. ruff clean. mypy clean on changed files. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
3 tasks
widgetii
added a commit
that referenced
this pull request
May 5, 2026
## Summary Third agent platform after ev300 (V4) and cv300 (V3): the **cv500-family** generation (V5) — Cortex-A7 with the same memory map shared by cv500, av300, and dv300. Most of the wiring already existed. The Cortex-A7 startup path with MMU + I/D cache from #67 covers cv500-family. PR #65 already taught \`hisilicon_standard.py\` the cv500-family bootrom quirks (12-byte 0xFF run zeroing, gzip SPL boundary, non-fatal SPL TAIL when prestep_data is set). Two real fixes were needed to make av300 actually run. ## Fixes ### 1. Wrong UART base in `agent/Makefile` cv500 entry The cv500 block had \`UART_BASE = 0x12100000\` copy-pasted from the V3/hi3518ev200 layout. qemu-hisilicon's \`hi3516cv500_soc\` (which models the whole cv500 family, av300 included) puts UART0 at **\`0x120A0000\`**. QEMU surfaced it instantly: agent ran silently because writes went to unmapped I/O. Fix: \`0x120A0000\`. ### 2. SPL boundary detected from the wrong buffer \`_send_spl()\` in the agent-flash flow takes both \`firmware\` (the agent binary) and \`spl_override\` (OpenIPC u-boot used as the SPL stage), but \`_detect_spl_size\` was scanning \`firmware\`. The agent has no compressed payload, so the scan fell through to \`profile_max = 0x6000\` — overshooting the real SPL code (ends at \`0x5000\` on av300) by 0x1000 B. Those extra bytes include the **12-byte 0xFF padding at \`0x52E4\`** that PR #65 identified as the cv500-family bootrom RX-hang trigger. Result: SPL DATA chunk #21 stalls with no ACK, full 32-retry exhaustion. Fix: detect the SPL boundary from \`spl_override\` when present. Plus defense-in-depth: apply \`_zero_long_ff_runs\` to \`spl_data\` so any ≥12-byte 0xFF run that does slip through (e.g. a non-OpenIPC SPL build with FF padding earlier in the binary) doesn't trip the same bug. ### 3. Aliasing in \`chip_to_agent\` Match the existing \`gk7205v300 → gk7205v200\` pattern: one binary, multiple chip names route to it. Add \`hi3516av300 → hi3516cv500\` and \`hi3516dv300 → hi3516cv500\` to \`get_agent_binary()\`. No new Makefile entry needed. ### 4. Cosmetic profile name fix Both \`hi3516av300.json\` and \`hi3516dv300.json\` had \`"name": "hi3516cv500"\` (copy-paste artifact — the profile loader keys off filename so this is harmless, but inconsistent). ## Verification QEMU smoke test: \`qemu-system-arm -M hi3516cv500\` runs the agent cleanly, READY/DEFIB packet stream, no faults. Real **hi3516av300** board (\`/dev/uart-hi3516av300_imx415\`, MikroTik \`ether8\`): \`\`\` jedec=c22019 (Macronix MX25L256, 32 MiB), ram=0x80000000, caps=0x7f, version=2 256 KiB @ 921600: 3.04 s = 84.3 KB/s 1 MiB sustained @ 921600: 11.74 s = 87.2 KB/s flash bytes match installed u-boot byte-for-byte \`\`\` Cross-platform throughput summary (1 MiB read, 921600 baud): | | CPU | Generation | KB/s | |---|---|---|---| | ev300 | Cortex-A7 | V4 | 87.1 | | cv300 | ARM926 | V3 | 89.0 | | **av300** | **Cortex-A7** | **V5 (cv500-family)** | **87.2** | All three within ~2 KB/s — UART baud is the bottleneck. \`\`\` make -C agent test HOST_CC=gcc: 5406/5406 pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped ruff check src/ tests/: All checks passed mypy src/defib/ --ignore-missing-imports: no issues found in 55 source files \`\`\` dv300 has no test board attached — it inherits the same binary as a silent alias. av300 hardware proves the same codepath cv500/dv300 will take. ## Test plan - [x] QEMU \`-M hi3516cv500\` boots agent cleanly (caught the wrong UART base) - [x] Real av300 hardware: agent upload, info, flash read at 921600 baud, content verified - [x] All test suites green (host C, pytest, ruff, mypy) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Dmitry Ilyin <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
4 tasks
widgetii
added a commit
that referenced
this pull request
May 5, 2026
## Summary Fourth agent platform after **ev300 (V4)**, **cv300 (V3)**, and **av300/dv300/cv500 (V5/cv500-family)**. V3A = `3519v101 + av200` — Cortex-A7 with V3-era peripheral addresses (UART `0x12100000`, WDT `0x12080000`) but DDR at `0x80000000` like cv500-family, per qemu-hisilicon's `hi3519v101_soc`. The bootrom-protocol quirks (`sendFrameForStart` handshake, `PRESTEP1` DDR training step, non-fatal TAILs) were already landed for `defib install` / `defib burn` in #47 + #48 + #65. This PR is just the agent build wiring plus one real protocol fix the agent-flash path was missing. ## The fix: don't pre-truncate `spl_override` at the call site `defib agent upload` / `agent flash` were doing: \`\`\`python spl_data = cached_fw.read_bytes()[:profile.spl_max_size] \`\`\` before passing to `send_firmware()`. When `_send_spl()` then scans this truncated buffer for the LZMA/gzip SPL boundary, it can't find anything past `profile_max` — so for chips where the OpenIPC SPL is *larger* than the HiTool reference (e.g. **av200's SVB-enabled SPL is `0x6800`, but `profile_max` is `0x4F00`**), we send `0x1900` too few bytes. The SPL never finishes its post-DDR-init code, the SPL TAIL completes with no follow-through, and the agent HEAD frame for `0x81000000` gets `0x08` rejection. Fix: pass the full u-boot binary as `spl_override`. `_send_spl()` already handles the slicing via its detected LZMA/gzip boundary. ## Verification **QEMU** `qemu-system-arm -M hi3519v101 -kernel agent-hi3519v101.elf`: agent boots cleanly, READY/DEFIB packet stream, no faults. **Real hi3516av200** board (`/dev/ttyUSB1`, MikroTik `ether8`): \`\`\` upload ok=True agent ready: ram_base=0x80000000 caps=0x7f version=2 \`\`\` The board in our lab has SPI NAND and the agent's NOR-only flash driver returns shifted JEDEC bytes / 0-byte reads on NAND; that's a separate larger limitation noted at the bottom. **hi3516cv300 regression** (`/dev/uart-IVGHP203Y-AF`, MikroTik `ether3`): \`\`\` agent ready: jedec=ef4018 flash=16384KiB ram=0x80000000 caps=0x7f 256 KiB @ 921600: 3.02 s = 84.9 KB/s \`\`\` The spl_override-truncation fix changes cv300's SPL size from `0x4F00` (clamped) to `0x5400` (LZMA-detected at offset `0x54c8`). Agent loads identically, throughput unchanged — the previous undersized SPL was working coincidentally because cv300's SPL pieces happened to fit in the smaller window. \`\`\` make -C agent test HOST_CC=gcc: 5406/5406 pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped ruff & mypy: clean \`\`\` All four agent SoCs (ev300, cv300, cv500, 3519v101) build clean. ## Aliasing Following the existing `gk7205v300 → gk7205v200` shape: one binary serves the family, multiple chip names route to it. \`\`\`python "hi3519v101": "hi3519v101", "hi3516av200": "hi3519v101", # 3519v101 family, same memory map \`\`\` ## Known limitation: SPI NAND on av200 boards The av200 board in our lab has **SPI NAND** flash. The agent's flash driver (`agent/spi_flash.c`) supports SPI NOR only — uses the memory-mapped read window at `0x14000000` and direct FMC register commands. On a NAND board: - Agent loads, runs, and emits READY ✓ - `agent info` returns shifted JEDEC bytes (e.g. `00c212` instead of valid `c2 XX YY`) - `agent read` returns 0 bytes - Erase/write/scan won't work This affects all V3-and-later HiSilicon chips that ship with SPI NAND (some av200, some av300, some cv500). Adding SPI NAND support to the agent is its own piece of work. This PR ships the platform wiring; a follow-up can address NAND. ## Test plan - [x] QEMU `-M hi3519v101`: agent boots cleanly - [x] Real av200 hardware: agent uploads + runs + READY - [x] cv300 regression: throughput and behavior unchanged - [x] All test suites green 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Dmitry Ilyin <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
End-to-end install of OpenIPC firmware on hi3516av300 hung mid-transfer in two places. This PR fixes both, plus the supporting infrastructure that masked the failures as indefinite hangs.
Why
End-to-end
defib install -c hi3516av300 --firmware openipc.hi3516av300-nor-neo.tgzfailed silently — process blocked indefinitely with no error.Two distinct hi3516cv500-family bootrom quirks, plus the
SerialTransport.write()blocking forever on stalled writes (which masked everything as a hang).SPL boundary detection (gzip + round-down)
OpenIPC's hi3516av300 universal U-Boot uses gzip (not LZMA) to compress the embedded U-Boot payload. The previous
_detect_spl_sizeonly scanned for LZMA, found nothing, and fell back to the profile defaultFILELEN[1]=0x6000(24 KB). That overshoots the actual 21 KB SPL code by 3 KB into SRAM that the bootrom uses for its own working memory — the cv500-family bootrom hangs the moment we start overwriting it.PR #55's
max(detected, profile_max)was correct for HiTool reference SPLs (which fill the full window) and for SVB-enabled av200 (where detected > profile_max). It is wrong for OpenIPC builds that are more compact than HiTool's reference: we must trust the detected boundary even when it's smaller than profile_max.Two changes:
1f 8b 08) in addition to LZMA.SPL TAIL non-fatal for av200/av300
When
prestep_datais set (av200/av300/sendFrameForStart chips), the SPL detaches the bootrom protocol handler as soon as it receives the declared byte count, so the SPL TAIL frame is never ACKed. Treating that as fatal stalled the SPL stage. Mirrors the existing best-effort TAIL handling for U-Boot on these same chips.U-Boot upload: zero long 0xFF runs
After the SPL boundary fix, the U-Boot upload to DDR reproducibly hung at chunk 21 with the exact same byte content. Bisection narrowed the trigger to a 12-byte run of
0xFFpadding between the end of the SPL code and the start of the gzip header (byte offsets0x52E4..0x52EFof the av300 universal U-Boot).Confirmed root cause:
0xFFbytes: PASS0xFFbytes: HANG mid-DATA frame, no ACK ever0x00: full 248 KB U-Boot uploads cleanly and the rest of the install completes.Almost certainly a UART RX-path quirk in the cv500-family bootrom. The 0xFF runs are inert padding, never executed by anything, so zeroing them in
_send_ubootis safe. Threshold of 12 matches the empirically observed boundary.Write timeout via pyserial
write_timeoutBoth bugs above presented as indefinite hangs, not errors, because
SerialTransport.writehad no timeout. pyserial's blockingwrite()blocked inpselect6forever when the kernel TX buffer stopped draining (because the device stopped accepting bytes).asyncio.wait_foron arun_in_executorfuture does not help — cancelling the asyncio task can't interrupt a thread blocked in a syscall. The fix is to setport.write_timeout = 5.0so pyserial itself returns and raisesSerialTimeoutException, which we map toTransportTimeout. 5 s ceiling: a 1 KB write at 115200 baud is ~89 ms.Retry-loop catches write timeouts
Previously
transport.write(frame_data)was called outside the retry loop'stry/except TransportTimeout. A transient write failure would propagate up the stack and bypass retry. Moving the write inside the try block makes write timeouts symmetric with read timeouts — both are retried.install:
--nor-size 32The
installsubcommand only supported 8 MB and 16 MB NOR layouts. Added a 32 MB layout (256 KB boot, 64 KB env, 3 MB kernel, 24 MB rootfs, rest rootfs_data). OpenIPC U-Boot definessetnor8mandsetnor16menv vars but notsetnor32m, so for the 32 MB case we send the raw mtdparts string inline instead ofrun mtdpartsnor32m.Verification
End-to-end on a real hi3516av300 (Vstarcam, IMX415):
```
Phase 1: Burning U-Boot to RAM ✓ 32 s
SPL boundary detected (gzip) at 0x5000 (20480 bytes);
profile default was 0x6000 (24576 bytes)
_zero_long_ff_runs: zeroed 12 0xFF bytes at offset 0x52E4
DDR step / SPL / U-Boot complete ✓
Phase 2: Flash via TFTP
U-Boot → 0x000000 (248775 B) Flash verified DF75ECC7
kernel → 0x050000 (1987111 B) Flash verified 49DA597E
rootfs → 0x350000 (7499776 B) Flash verified AF769A05
Setting boot environment ✓
Resetting device ✓
Install complete! Device is rebooting into OpenIPC.
```
12 new regression tests in
tests/test_protocol_standard.py:TestDetectSplSize— gzip + LZMA detection, round-down, no-marker fallback, scan-window boundsTestZeroLongFfRuns— threshold edge (11 vs 12), end-of-buffer, multiple runs, no-op short-circuit, exact av300 padding patternTestSplTailNonFatalForFrameBlast— succeeds on av300 without TAIL ACK; still strictly fails on chips without prestep_dataTestWriteTimeoutRetry—_send_frame_with_retryrecovers from transient write timeouts402 tests pass. ruff clean. mypy clean on changed files.
Test plan
uv run pytest tests/ -x --ignore=tests/fuzz(402 pass)uv run defib install -c hi3516av300 --firmware <openipc.tgz> -p <port> --power-cycle --nor-size 32against a real hi3516av300 NOR-32 board_detect_spl_size+_zero_long_ff_runspaths; av200 LZMA detection still trusted as before)prestep_data(SPL TAIL still treated as fatal — covered bytest_spl_tail_no_ack_fails_for_chip_without_prestep)🤖 Generated with Claude Code