Fix stuck InatImport when worker crashes#4122
Conversation
Add InatImportRecoveryJob, scheduled every 3 minutes, to find InatImport records stuck in Importing state (updated_at older than STUCK_THRESHOLD) and mark them Done with an error message. Addresses the case where a worker process is SIGKILL'd or OOM-killed and the job's ensure block never runs. Also bulletproof InatImportJob#perform: add a rescue Exception clause (re-raising shutdown signals) to log unexpected errors to response_errors, and wrap the ensure's done call in safe_done so a DB failure there doesn't suppress the original exception.
There was a problem hiding this comment.
Pull request overview
This PR adds automated recovery for iNat imports that get stuck in the Importing state (e.g., when a worker is SIGKILL’d/OOM-killed) and hardens InatImportJob cleanup/error handling so import records don’t remain permanently “in progress”.
Changes:
- Add
InatImport::STUCK_THRESHOLDandInatImport#stuck?to detect stale “Importing” records. - Add
InatImportRecoveryJoband schedule it (every 3 minutes in production) to mark stale importsDoneand append an error message. - Update
InatImportJobto handle unexpected exceptions and to wrap the finaldonecall insafe_done, plus add tests for these behaviors.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
app/models/inat_import.rb |
Adds STUCK_THRESHOLD and stuck? predicate for stale-import detection. |
app/jobs/inat_import_recovery_job.rb |
New recurring job that finds stale imports and marks them Done with an error message. |
app/jobs/inat_import_job.rb |
Broadens exception handling and wraps done in safe_done from ensure. |
config/recurring.yml |
Schedules InatImportRecoveryJob in production every 3 minutes. |
test/models/inat_import_test.rb |
Adds unit tests for InatImport#stuck?. |
test/jobs/inat_import_recovery_job_test.rb |
Adds tests for the recovery job behavior. |
test/jobs/inat_import_job_test.rb |
Adds tests for safe_done behavior. |
Extract shutdown_signal? predicate to reduce perform's AbcSize, replace $! with $ERROR_INFO (English stdlib), and correct the comment that claimed the ensure block had "already run" before re-raise — it runs during stack unwinding after re-raise, not before. Also fix the ensure behavior: skip safe_done on shutdown signals so the import record stays Importing and SolidQueue can requeue the job, rather than prematurely marking it Done on graceful shutdown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NoMemoryError, SystemStackError, and ScriptError should propagate rather than be caught, logged, and swallowed — the worker's state is undefined in those cases. Also renames shutdown_signal? to non_rescuable? to reflect the broader intent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If done raises with no original exception in flight, re-raise so the job fails visibly and SolidQueue can retry, rather than silently leaving the import in Importing state and letting the recovery job show a misleading "may have crashed" message. Only swallow done failures when an exception is already being handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the inline where chain in InatImportRecoveryJob with InatImport.stuck, eliminating duplication and fixing an inconsistency where the job used inclusive .. (<=) while stuck? used strict <. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Combine state, ended_at, and response_errors into a single update! call so the transition to Done and the crash message are written atomically and failures raise rather than silently returning false. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stub Rails.logger.error in test_safe_done_reraises_when_done_fails_on_happy_path to avoid ERROR lines in test output that could be mistaken for real failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover the rescue Exception body (lines hit when a non-StandardError, non-fatal exception propagates through perform) and the NoMemoryError, SystemStackError, and ScriptError branches of non_rescuable?, which were unreachable via prior tests due to short-circuit evaluation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| # An import stuck in Importing state longer than this is assumed to have | ||
| # crashed. Must match the schedule in config/recurring.yml. | ||
| STUCK_THRESHOLD = 3.minutes |
There was a problem hiding this comment.
Does this mean any job that runs for over 3 minutes gets reaped?
There was a problem hiding this comment.
CC:
No — the key is updated_at, not total runtime. The stuck? method's comment explains it:
updated_at is touched after every observation import, so no update in longer than STUCK_THRESHOLD indicates a crashed worker.
A healthy import refreshes updated_at after each observation, so an import processing thousands of observations over an hour would never be flagged. Only an import that has gone silent — no activity for 3 minutes — is considered stuck, which is the signal that the worker crashed.
mo-nathan
left a comment
There was a problem hiding this comment.
Approved pending confirmation that the time delay is for updates while the job is running vs. the total time the job has run so far. Even so 3.minutes seems a bit short.
Addresses the "The one failed job" case described in a comment to Issue 4117
Add InatImportRecoveryJob, scheduled every 3 minutes, to find InatImport records stuck in Importing state (updated_at older than STUCK_THRESHOLD) and mark them Done with an error message. Addresses the case where a worker process is SIGKILL'd or OOM-killed and the job's ensure block never runs.
Also bulletproof InatImportJob#perform: add a rescue Exception clause (re-raising shutdown signals) to log unexpected errors to response_errors, and wrap the ensure's done call in safe_done so a DB failure there doesn't suppress the original exception.
Resolves #4119