fix: retry qq_official startup after gateway timeout#6940
fix: retry qq_official startup after gateway timeout#69401zzxy1 wants to merge 2 commits intoAstrBotDevs:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the resilience of the QQ Official platform adapter by introducing an automatic retry mechanism for startup failures. It ensures that the adapter can recover from temporary network issues or service unavailability without manual intervention, improving the overall stability and reliability of the bot's connection to the QQ platform. The changes also refine the client's lifecycle management and provide robust shutdown procedures. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The new
_should_retry_startup_errorrelies on matchingTypeErrormessage substrings, which is brittle across Python versions and implementations; consider either checking for a more specific exception type at the source or passing a sentinel/flag from the failing code path so you can reliably distinguish the gateway timeout case. - Changing
runfrom a sync method toasync defmay break existing callers that expect a blocking call; if there are non-async call sites, consider preserving a thin synchronous wrapper (e.g.def run(self): asyncio.run(self._run_impl())) or otherwise ensuring the interface remains consistent with the basePlatformAPI.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `_should_retry_startup_error` relies on matching `TypeError` message substrings, which is brittle across Python versions and implementations; consider either checking for a more specific exception type at the source or passing a sentinel/flag from the failing code path so you can reliably distinguish the gateway timeout case.
- Changing `run` from a sync method to `async def` may break existing callers that expect a blocking call; if there are non-async call sites, consider preserving a thin synchronous wrapper (e.g. `def run(self): asyncio.run(self._run_impl())`) or otherwise ensuring the interface remains consistent with the base `Platform` API.
## Individual Comments
### Comment 1
<location path="astrbot/core/platform/sources/qqofficial/qqofficial_platform_adapter.py" line_range="145-152" />
<code_context>
-
- self._session_last_message_id: dict[str, str] = {}
- self._session_scene: dict[str, str] = {}
+ @staticmethod
+ def _should_retry_startup_error(error: Exception) -> bool:
+ if isinstance(error, (asyncio.TimeoutError, ConnectionError, OSError)):
+ return True
+ if isinstance(error, TypeError):
+ error_msg = str(error)
+ return "NoneType" in error_msg and "subscriptable" in error_msg
+ return False
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The TypeError string matching in `_should_retry_startup_error` is brittle and could misclassify unrelated errors.
Relying on the TypeError message containing both "NoneType" and "subscriptable" is fragile across Python versions/locales and may catch unrelated errors. If this is guarding against a specific upstream bug, consider tightening the check using more contextual signals (e.g., stack frame or call site), or avoid treating generic TypeErrors as retryable and instead handle the known case at its source to prevent retry loops on genuine programming errors.
```suggestion
@staticmethod
def _should_retry_startup_error(error: Exception) -> bool:
"""
Determine whether a startup error is transient and should be retried.
We only treat clearly transient/network-related errors as retryable to avoid
masking genuine programming errors (e.g., generic TypeError).
"""
return isinstance(error, (asyncio.TimeoutError, ConnectionError, OSError))
```
</issue_to_address>
### Comment 2
<location path="astrbot/core/platform/sources/qqofficial/qqofficial_platform_adapter.py" line_range="136" />
<code_context>
+
+ self.test_mode = os.environ.get("TEST_MODE", "off") == "on"
+
+ def _create_client(self) -> botClient:
+ client = botClient(
intents=self.intents,
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the new client lifecycle and retry logic helpers to flatten control flow in `run()` and reduce indirection while preserving existing behavior.
You can keep all behavior while reducing indirection and nested control flow by:
### 1. Merge lifecycle helpers into a single restart helper
`_create_client`, `_close_client`, and `_recreate_client` are small and tightly coupled. You can keep `_create_client` (it’s used in `__init__`) and merge the other two into a single `_restart_client` to reduce the number of moving parts:
```python
def _create_client(self) -> botClient:
client = botClient(
intents=self.intents,
bot_log=False,
timeout=20,
)
client.set_platform(self)
return client
async def _restart_client(self) -> None:
try:
await self.client.close()
except asyncio.CancelledError:
raise
except Exception as e:
logger.warning(
"qq_official(%s): close client failed during recovery: %s",
self.meta().id,
e,
)
self.client = self._create_client()
```
This keeps the existing behavior but removes one helper and one extra level of indirection.
### 2. Extract a “sleep unless shutdown” helper
The `asyncio.wait_for` try/except used just to sleep with shutdown awareness makes the main loop harder to follow. Extracting it into a small helper keeps `run()` linear:
```python
async def _sleep_until_retry_or_shutdown(self) -> bool:
"""Return True if we should retry, False if shutdown was requested."""
try:
await asyncio.wait_for(
self._shutdown_event.wait(),
timeout=self.STARTUP_RETRY_DELAY_SECONDS,
)
return False # shutdown event set
except asyncio.TimeoutError:
return True # timeout -> retry
```
### 3. Flatten `run()` control flow
With the above helpers, `run()` can be written in a more linear way, with fewer nested `try` blocks and fewer scattered shutdown checks:
```python
async def run(self) -> None:
try:
while not self._shutdown_event.is_set():
try:
await self.client.start(appid=self.appid, secret=self.secret)
if self._shutdown_event.is_set():
break
logger.warning(
"qq_official(%s): client stopped unexpectedly, restarting in %ss",
self.meta().id,
self.STARTUP_RETRY_DELAY_SECONDS,
)
except asyncio.CancelledError:
raise
except Exception as e:
if not self._should_retry_startup_error(e) or self._shutdown_event.is_set():
raise
logger.warning(
"qq_official(%s): startup failed, retrying in %ss: %s",
self.meta().id,
self.STARTUP_RETRY_DELAY_SECONDS,
e,
)
await self._restart_client()
should_retry = await self._sleep_until_retry_or_shutdown()
if not should_retry:
break
finally:
await self._restart_client() # or a dedicated _close_client() if you prefer
```
This keeps:
- Automatic restart on unexpected stop or retryable errors.
- Respect for `_shutdown_event`.
- Retry delay with shutdown awareness.
- Client recreation between attempts.
- Final client close with logging.
But it reduces nested `try`/`except`, concentrates shutdown checks, and removes one lifecycle helper, making the restart/shutdown logic easier to scan and reason about.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
astrbot/core/platform/sources/qqofficial/qqofficial_platform_adapter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the robustness of the QQ Official Platform Adapter by implementing a comprehensive retry mechanism for its startup process. It introduces a STARTUP_RETRY_DELAY_SECONDS constant, refactors client creation and management into dedicated methods (_create_client, _close_client, _recreate_client), and uses an asyncio.Event for graceful shutdown. The run method now includes a retry loop that handles various startup errors, and new tests validate this retry logic. The review suggests that the current method of checking for a specific TypeError message string for retries is fragile and recommends adding a comment to explain this workaround for future maintainers.
| if isinstance(error, TypeError): | ||
| error_msg = str(error) | ||
| return "NoneType" in error_msg and "subscriptable" in error_msg |
There was a problem hiding this comment.
Checking for a specific TypeError by inspecting its message string is fragile. If the botpy library changes this error message in a future version, the retry logic will break.
While this may be a necessary workaround, it would be beneficial to add a comment explaining why this specific error message is being checked. This will provide crucial context for future maintainers. For example:
# The botpy library may raise a TypeError("'NoneType' object is not subscriptable")
# on a transient gateway timeout. We catch this specific error to enable retries.
# See issue: [link to botpy issue if any]
Summary
Testing
Closes #6858
Summary by Sourcery
Add retry and graceful shutdown logic to the QQ official platform adapter startup flow to handle transient gateway failures.
Bug Fixes:
Tests: