feat: cross-platform force-kill primitive for stuck PHP threads#2365
feat: cross-platform force-kill primitive for stuck PHP threads#2365nicolas-grekas wants to merge 1 commit intophp:mainfrom
Conversation
| #elif defined(PHP_WIN32) | ||
| if (thread_handles && thread_handle_saved[idx]) { | ||
| CancelSynchronousIo(thread_handles[idx]); | ||
| QueueUserAPC((PAPCFUNC)frankenphp_noop_apc, thread_handles[idx], 0); |
There was a problem hiding this comment.
TIL php sleep uses SleepEx (which is alertable), but usleep does not. This isn't the greatest option to wake up sleeping threads.
There was a problem hiding this comment.
The concern holds on Windows for non-alertable Sleep(), which is what PHP's usleep uses. From our layer there's no fix: TerminateThread would leave locks in undefined state. <-done in drainWorkerThreads will wait until the syscall returns naturally in that case. Documented the limitation next to the primitive.
On Linux/FreeBSD the new pthread_kill(SIGRTMIN+3) path does interrupt usleep (and the rest) via EINTR. On macOS it's the same limitation as Windows usleep (no realtime signals).
One thought: this could be fixed upstream by having php-src switch usleep on Windows from Sleep to SleepEx(msec, TRUE). That would make it alertable, and our QueueUserAPC would then interrupt it naturally. Might be worth a small PR to php-src.
There was a problem hiding this comment.
Thanks! Once that lands, our QueueUserAPC path on Windows will interrupt usleep naturally with no change on our side, and the drain-indefinitely concern on worker.go goes away for the usleep case too.
There was a problem hiding this comment.
(LLM posted this on its own but I'm sincerely grateful for this : D)
| if globalLogger.Enabled(globalCtx, slog.LevelWarn) { | ||
| globalLogger.LogAttrs(globalCtx, slog.LevelWarn, "worker threads did not yield within grace period, force-killing stuck threads") | ||
| } | ||
| <-done |
There was a problem hiding this comment.
circling back to Windows only actually interrupting SleepEx, wouldn't we block indefinitely here if the force_kill_thread didn't work?
There was a problem hiding this comment.
Valid concern. On Windows with PHP usleep (non-alertable Sleep), <-done will wait until the sleep returns on its own because the APC cannot interrupt it. Documented. Fixing it properly would need php-src to use SleepEx(msec, TRUE) for usleep so the APC can wake it.
There was a problem hiding this comment.
Well we could technically force kill a thread. But I doubt that'd be safe.
There was a problem hiding this comment.
Yeah, TerminateThread can abandon held locks mid-modification and corrupt the process - better to lean on your php-src fix.
| if (thread_php_timers && thread_php_timer_saved[idx]) { | ||
| struct itimerspec its; | ||
| its.it_value.tv_sec = 0; | ||
| its.it_value.tv_nsec = 1; | ||
| its.it_interval.tv_sec = 0; | ||
| its.it_interval.tv_nsec = 0; | ||
| timer_settime(thread_php_timers[idx], 0, &its, NULL); | ||
| } |
There was a problem hiding this comment.
Instead of storing timers like this, I'd actually prefer if we just stored a reference to the execution globals of each thread at startup. You then can just set EG(timed_out) to true.
There was a problem hiding this comment.
Done. Each thread now saves pointers to its own EG(vm_interrupt) and EG(timed_out) at boot; force_kill_thread does atomic-stores into both. That path is the cross-platform core, including macOS and Windows.
We kept a platform-specific wake-up on top: pthread_kill(SIGRTMIN+3) on Linux/FreeBSD, CancelSynchronousIo + QueueUserAPC on Windows. Without it, a thread stuck in a blocking syscall (select, sleep, DB wait) would only be interrupted when the syscall returned on its own.
Side effect of the switch: the old timer path silently did nothing for bg workers because EG(pid) was zero when max_execution_time=0, so this also fixes a latent bug.
a7be2c7 to
692ee4c
Compare
| thread_slots = calloc((size_t)num_threads, sizeof(force_kill_slot)); | ||
| if (thread_slots == NULL) { | ||
| /* Out of memory at startup: leave force-kill disabled rather than crash | ||
| * later in register/kill. Graceful drain still works via the yielding | ||
| * path. */ | ||
| force_kill_num_threads = 0; | ||
| return; |
There was a problem hiding this comment.
I think it would be cleaner to just store the force_kill_slot on the go side in the phpThread struct and then pass it back to C in frankenphp_force_kill_thread.
There was a problem hiding this comment.
Done. The slot now lives on phpThread and gets populated by go_frankenphp_store_force_kill_slot from the PHP thread's own TSRM context at boot; force_kill_thread(slot) takes it by value. That lets us drop the C-side thread_slots array, the bounds checks, and both init_force_kill / destroy_force_kill. Signal-handler install moved to a one-shot pthread_once inside register; Windows CloseHandle lives in a new release_thread_for_kill called during drain. Diff is -25 lines, and happens-before is covered because the thread writes the slot before it transitions to Ready.
Introduces a small, self-contained primitive that unblocks a PHP thread stuck in a blocking call (sleep, synchronous I/O, etc.) so the graceful drain used by RestartWorkers and DrainWorkers can make progress instead of waiting for the block to return on its own. The primitive is useful on its own and gives follow-up graceful-shutdown work a reviewed foundation to build on. - frankenphp.c: add frankenphp_init_force_kill / frankenphp_save_php_timer / frankenphp_force_kill_thread / frankenphp_destroy_force_kill. The per-thread PHP timer handle (Linux/FreeBSD ZTS) or OS thread handle (Windows) is captured at thread boot and stored in a pre-sized array so the kill path can fire from any goroutine without touching per-thread PHP state. Linux/FreeBSD arm PHP's max_execution_time timer (delivers SIGALRM -> "Maximum execution time exceeded"); Windows uses CancelSynchronousIo + QueueUserAPC to interrupt I/O and alertable waits; macOS and other platforms are a safe no-op (the thread is abandoned and exits when the blocking call returns naturally). - phpmainthread.go: wire frankenphp_init_force_kill into initPHPThreads (sized to maxThreads, matching the thread_metrics allocation) and frankenphp_destroy_force_kill into drainPHPThreads. - worker.go: add a 5-second graceful-drain grace period to drainWorkerThreads. Once elapsed, arm the force-kill primitive on any thread still outside Yielding and keep waiting on ready.Wait(); the kill lets the thread return from its blocking call so the drain completes in bounded time instead of hanging. - worker_test.go + testdata/worker-sleep.php: TestRestartWorkersForceKillsStuckThread drives the path end-to-end. A worker blocks inside sleep(60) below frankenphp_handle_request (so drainChan close can't reach it); the test asserts RestartWorkers returns within 8s (grace + slack). The test skips on platforms without the underlying primitive.
First step of the split suggested in #2287: land the force-kill
infrastructure as a standalone, reviewable primitive, independent of
background workers. Useful on its own and gives follow-up
graceful-shutdown work a reviewed foundation.
What
C primitives (
frankenphp.c,frankenphp.h):frankenphp_init_force_kill(int n)/frankenphp_destroy_force_kill()allocate/free per-thread slots.
frankenphp_save_php_timer(uintptr_t idx)captures this thread'stimer/handle at boot, so the kill path can fire from any goroutine
without touching per-thread PHP state directly.
frankenphp_force_kill_thread(uintptr_t idx)arms the kill:max_execution_timetimer, deliversSIGALRM, raises
Maximum execution time exceeded.CancelSynchronousIo+QueueUserAPCto interrupt I/Oand alertable waits.
exits when the blocking call returns on its own.
Go wiring (
phpmainthread.go):initPHPThreadsallocates the slots (sized tomaxThreads).drainPHPThreadsfrees them.frankenphp_save_php_timeronce up front.Graceful drain (
worker.go):drainWorkerThreadsnow waits up to 5 s for threads to yield, thenarms the kill on any still-stuck thread and keeps waiting on
ready.Wait(). Drain completes in bounded time instead of hangingon a stuck
sleep/I/O call.Notes