Skip to content

fix: retry qq_official startup after gateway timeout#6940

Open
1zzxy1 wants to merge 2 commits intoAstrBotDevs:masterfrom
1zzxy1:fix/qq-official-startup-retry
Open

fix: retry qq_official startup after gateway timeout#6940
1zzxy1 wants to merge 2 commits intoAstrBotDevs:masterfrom
1zzxy1:fix/qq-official-startup-retry

Conversation

@1zzxy1
Copy link
Contributor

@1zzxy1 1zzxy1 commented Mar 25, 2026

Summary

  • retry qq_official startup when �otpy crashes after a transient /gateway/bot timeout
  • recreate the QQ official client between retry attempts and stop cleanly on shutdown
  • add regression tests for retryable startup failures, non-retryable errors, and CancelledError

Testing

  • uv run --group dev pytest tests/test_qqofficial_platform_adapter.py -q
  • uv run --group dev ruff check astrbot/core/platform/sources/qqofficial/qqofficial_platform_adapter.py tests/test_qqofficial_platform_adapter.py

Closes #6858

Summary by Sourcery

Add retry and graceful shutdown logic to the QQ official platform adapter startup flow to handle transient gateway failures.

Bug Fixes:

  • Retry QQ official client startup on transient timeout and connection-related failures instead of exiting.
  • Ensure the QQ official client is always closed and recreated cleanly during recovery and on adapter termination.
  • Propagate non-retryable errors and cancellation correctly from the QQ official adapter run loop.

Tests:

  • Add regression tests covering retryable startup failures, non-retryable startup errors, and CancelledError propagation for the QQ official adapter.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. area:platform The bug / feature is about IM platform adapter, such as QQ, Lark, Telegram, WebChat and so on. labels Mar 25, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience of the QQ Official platform adapter by introducing an automatic retry mechanism for startup failures. It ensures that the adapter can recover from temporary network issues or service unavailability without manual intervention, improving the overall stability and reliability of the bot's connection to the QQ platform. The changes also refine the client's lifecycle management and provide robust shutdown procedures.

Highlights

  • Robust Startup Retry Mechanism: Implemented a retry mechanism for the QQ Official platform adapter to handle transient startup failures, such as gateway timeouts, connection errors, and specific TypeError instances.
  • Client Lifecycle Management: Refactored the client creation and closing logic into dedicated methods (_create_client, _close_client, _recreate_client) to ensure proper cleanup and reinitialization during retries or shutdown.
  • Graceful Shutdown: Introduced an asyncio.Event (_shutdown_event) to facilitate graceful termination of the adapter, allowing ongoing operations to complete and preventing new retries during shutdown.
  • Comprehensive Test Coverage: Added new regression tests to validate the retry logic for transient errors, confirm correct propagation of non-retryable errors, and ensure proper handling of asyncio.CancelledError.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The new _should_retry_startup_error relies on matching TypeError message substrings, which is brittle across Python versions and implementations; consider either checking for a more specific exception type at the source or passing a sentinel/flag from the failing code path so you can reliably distinguish the gateway timeout case.
  • Changing run from a sync method to async def may break existing callers that expect a blocking call; if there are non-async call sites, consider preserving a thin synchronous wrapper (e.g. def run(self): asyncio.run(self._run_impl())) or otherwise ensuring the interface remains consistent with the base Platform API.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `_should_retry_startup_error` relies on matching `TypeError` message substrings, which is brittle across Python versions and implementations; consider either checking for a more specific exception type at the source or passing a sentinel/flag from the failing code path so you can reliably distinguish the gateway timeout case.
- Changing `run` from a sync method to `async def` may break existing callers that expect a blocking call; if there are non-async call sites, consider preserving a thin synchronous wrapper (e.g. `def run(self): asyncio.run(self._run_impl())`) or otherwise ensuring the interface remains consistent with the base `Platform` API.

## Individual Comments

### Comment 1
<location path="astrbot/core/platform/sources/qqofficial/qqofficial_platform_adapter.py" line_range="145-152" />
<code_context>
-
-        self._session_last_message_id: dict[str, str] = {}
-        self._session_scene: dict[str, str] = {}
+    @staticmethod
+    def _should_retry_startup_error(error: Exception) -> bool:
+        if isinstance(error, (asyncio.TimeoutError, ConnectionError, OSError)):
+            return True
+        if isinstance(error, TypeError):
+            error_msg = str(error)
+            return "NoneType" in error_msg and "subscriptable" in error_msg
+        return False
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The TypeError string matching in `_should_retry_startup_error` is brittle and could misclassify unrelated errors.

Relying on the TypeError message containing both "NoneType" and "subscriptable" is fragile across Python versions/locales and may catch unrelated errors. If this is guarding against a specific upstream bug, consider tightening the check using more contextual signals (e.g., stack frame or call site), or avoid treating generic TypeErrors as retryable and instead handle the known case at its source to prevent retry loops on genuine programming errors.

```suggestion
    @staticmethod
    def _should_retry_startup_error(error: Exception) -> bool:
        """
        Determine whether a startup error is transient and should be retried.

        We only treat clearly transient/network-related errors as retryable to avoid
        masking genuine programming errors (e.g., generic TypeError).
        """
        return isinstance(error, (asyncio.TimeoutError, ConnectionError, OSError))
```
</issue_to_address>

### Comment 2
<location path="astrbot/core/platform/sources/qqofficial/qqofficial_platform_adapter.py" line_range="136" />
<code_context>
+
+        self.test_mode = os.environ.get("TEST_MODE", "off") == "on"
+
+    def _create_client(self) -> botClient:
+        client = botClient(
             intents=self.intents,
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the new client lifecycle and retry logic helpers to flatten control flow in `run()` and reduce indirection while preserving existing behavior.

You can keep all behavior while reducing indirection and nested control flow by:

### 1. Merge lifecycle helpers into a single restart helper

`_create_client`, `_close_client`, and `_recreate_client` are small and tightly coupled. You can keep `_create_client` (it’s used in `__init__`) and merge the other two into a single `_restart_client` to reduce the number of moving parts:

```python
def _create_client(self) -> botClient:
    client = botClient(
        intents=self.intents,
        bot_log=False,
        timeout=20,
    )
    client.set_platform(self)
    return client

async def _restart_client(self) -> None:
    try:
        await self.client.close()
    except asyncio.CancelledError:
        raise
    except Exception as e:
        logger.warning(
            "qq_official(%s): close client failed during recovery: %s",
            self.meta().id,
            e,
        )
    self.client = self._create_client()
```

This keeps the existing behavior but removes one helper and one extra level of indirection.

### 2. Extract a “sleep unless shutdown” helper

The `asyncio.wait_for` try/except used just to sleep with shutdown awareness makes the main loop harder to follow. Extracting it into a small helper keeps `run()` linear:

```python
async def _sleep_until_retry_or_shutdown(self) -> bool:
    """Return True if we should retry, False if shutdown was requested."""
    try:
        await asyncio.wait_for(
            self._shutdown_event.wait(),
            timeout=self.STARTUP_RETRY_DELAY_SECONDS,
        )
        return False  # shutdown event set
    except asyncio.TimeoutError:
        return True   # timeout -> retry
```

### 3. Flatten `run()` control flow

With the above helpers, `run()` can be written in a more linear way, with fewer nested `try` blocks and fewer scattered shutdown checks:

```python
async def run(self) -> None:
    try:
        while not self._shutdown_event.is_set():
            try:
                await self.client.start(appid=self.appid, secret=self.secret)
                if self._shutdown_event.is_set():
                    break
                logger.warning(
                    "qq_official(%s): client stopped unexpectedly, restarting in %ss",
                    self.meta().id,
                    self.STARTUP_RETRY_DELAY_SECONDS,
                )
            except asyncio.CancelledError:
                raise
            except Exception as e:
                if not self._should_retry_startup_error(e) or self._shutdown_event.is_set():
                    raise
                logger.warning(
                    "qq_official(%s): startup failed, retrying in %ss: %s",
                    self.meta().id,
                    self.STARTUP_RETRY_DELAY_SECONDS,
                    e,
                )

            await self._restart_client()

            should_retry = await self._sleep_until_retry_or_shutdown()
            if not should_retry:
                break
    finally:
        await self._restart_client()  # or a dedicated _close_client() if you prefer
```

This keeps:

- Automatic restart on unexpected stop or retryable errors.
- Respect for `_shutdown_event`.
- Retry delay with shutdown awareness.
- Client recreation between attempts.
- Final client close with logging.

But it reduces nested `try`/`except`, concentrates shutdown checks, and removes one lifecycle helper, making the restart/shutdown logic easier to scan and reason about.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the robustness of the QQ Official Platform Adapter by implementing a comprehensive retry mechanism for its startup process. It introduces a STARTUP_RETRY_DELAY_SECONDS constant, refactors client creation and management into dedicated methods (_create_client, _close_client, _recreate_client), and uses an asyncio.Event for graceful shutdown. The run method now includes a retry loop that handles various startup errors, and new tests validate this retry logic. The review suggests that the current method of checking for a specific TypeError message string for retries is fragile and recommends adding a comment to explain this workaround for future maintainers.

Comment on lines +149 to +151
if isinstance(error, TypeError):
error_msg = str(error)
return "NoneType" in error_msg and "subscriptable" in error_msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking for a specific TypeError by inspecting its message string is fragile. If the botpy library changes this error message in a future version, the retry logic will break.

While this may be a necessary workaround, it would be beneficial to add a comment explaining why this specific error message is being checked. This will provide crucial context for future maintainers. For example:

# The botpy library may raise a TypeError("'NoneType' object is not subscriptable")
# on a transient gateway timeout. We catch this specific error to enable retries.
# See issue: [link to botpy issue if any]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:platform The bug / feature is about IM platform adapter, such as QQ, Lark, Telegram, WebChat and so on. size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qq_official can crash on startup after GET /gateway/bot timeout: NoneType is not subscriptable

1 participant