Fix stuck InatImport when worker crashes by JoeCohen · Pull Request #4122 · MushroomObserver/mushroom-observer

JoeCohen · 2026-04-13T17:33:49Z

Addresses the "The one failed job" case described in a comment to Issue 4117

Add InatImportRecoveryJob, scheduled every 3 minutes, to find InatImport records stuck in Importing state (updated_at older than STUCK_THRESHOLD) and mark them Done with an error message. Addresses the case where a worker process is SIGKILL'd or OOM-killed and the job's ensure block never runs.

Also bulletproof InatImportJob#perform: add a rescue Exception clause (re-raising shutdown signals) to log unexpected errors to response_errors, and wrap the ensure's done call in safe_done so a DB failure there doesn't suppress the original exception.

Resolves #4119

Add InatImportRecoveryJob, scheduled every 3 minutes, to find InatImport records stuck in Importing state (updated_at older than STUCK_THRESHOLD) and mark them Done with an error message. Addresses the case where a worker process is SIGKILL'd or OOM-killed and the job's ensure block never runs. Also bulletproof InatImportJob#perform: add a rescue Exception clause (re-raising shutdown signals) to log unexpected errors to response_errors, and wrap the ensure's done call in safe_done so a DB failure there doesn't suppress the original exception.

coveralls · 2026-04-13T17:54:35Z

coverage: 96.268% (+0.002%) from 96.266% — jdc-4119-import-hung-after-error into main

Copilot

Pull request overview

This PR adds automated recovery for iNat imports that get stuck in the Importing state (e.g., when a worker is SIGKILL’d/OOM-killed) and hardens InatImportJob cleanup/error handling so import records don’t remain permanently “in progress”.

Changes:

Add InatImport::STUCK_THRESHOLD and InatImport#stuck? to detect stale “Importing” records.
Add InatImportRecoveryJob and schedule it (every 3 minutes in production) to mark stale imports Done and append an error message.
Update InatImportJob to handle unexpected exceptions and to wrap the final done call in safe_done, plus add tests for these behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`app/models/inat_import.rb`	Adds `STUCK_THRESHOLD` and `stuck?` predicate for stale-import detection.
`app/jobs/inat_import_recovery_job.rb`	New recurring job that finds stale imports and marks them `Done` with an error message.
`app/jobs/inat_import_job.rb`	Broadens exception handling and wraps `done` in `safe_done` from `ensure`.
`config/recurring.yml`	Schedules `InatImportRecoveryJob` in production every 3 minutes.
`test/models/inat_import_test.rb`	Adds unit tests for `InatImport#stuck?`.
`test/jobs/inat_import_recovery_job_test.rb`	Adds tests for the recovery job behavior.
`test/jobs/inat_import_job_test.rb`	Adds tests for `safe_done` behavior.

Extract shutdown_signal? predicate to reduce perform's AbcSize, replace $! with $ERROR_INFO (English stdlib), and correct the comment that claimed the ensure block had "already run" before re-raise — it runs during stack unwinding after re-raise, not before. Also fix the ensure behavior: skip safe_done on shutdown signals so the import record stays Importing and SolidQueue can requeue the job, rather than prematurely marking it Done on graceful shutdown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NoMemoryError, SystemStackError, and ScriptError should propagate rather than be caught, logged, and swallowed — the worker's state is undefined in those cases. Also renames shutdown_signal? to non_rescuable? to reflect the broader intent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

If done raises with no original exception in flight, re-raise so the job fails visibly and SolidQueue can retry, rather than silently leaving the import in Importing state and letting the recovery job show a misleading "may have crashed" message. Only swallow done failures when an exception is already being handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the inline where chain in InatImportRecoveryJob with InatImport.stuck, eliminating duplication and fixing an inconsistency where the job used inclusive .. (<=) while stuck? used strict <. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Combine state, ended_at, and response_errors into a single update! call so the transition to Done and the crash message are written atomically and failures raise rather than silently returning false. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Stub Rails.logger.error in test_safe_done_reraises_when_done_fails_on_happy_path to avoid ERROR lines in test output that could be mistaken for real failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cover the rescue Exception body (lines hit when a non-StandardError, non-fatal exception propagates through perform) and the NoMemoryError, SystemStackError, and ScriptError branches of non_rescuable?, which were unreachable via prior tests due to short-circuit evaluation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mo-nathan · 2026-04-20T20:25:25Z


+  # An import stuck in Importing state longer than this is assumed to have
+  # crashed. Must match the schedule in config/recurring.yml.
+  STUCK_THRESHOLD = 3.minutes


Does this mean any job that runs for over 3 minutes gets reaped?

CC:
No — the key is updated_at, not total runtime. The stuck? method's comment explains it:

updated_at is touched after every observation import, so no update in longer than STUCK_THRESHOLD indicates a crashed worker.

A healthy import refreshes updated_at after each observation, so an import processing thousands of observations over an hour would never be flagged. Only an import that has gone silent — no activity for 3 minutes — is considered stuck, which is the signal that the worker crashed.

mo-nathan

Approved pending confirmation that the time delay is for updates while the job is running vs. the total time the job has run so far. Even so 3.minutes seems a bit short.

JoeCohen linked an issue Apr 13, 2026 that may be closed by this pull request

Import hung after Error #4119

Closed

JoeCohen self-assigned this Apr 13, 2026

JoeCohen added bug iNat interaction with iNat labels Apr 13, 2026

Fix test Error in prioir commit

8577ecd

JoeCohen marked this pull request as ready for review April 13, 2026 17:57

Copilot AI review requested due to automatic review settings April 13, 2026 17:57

Copilot started reviewing on behalf of JoeCohen April 13, 2026 17:58 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread app/jobs/inat_import_job.rb

Comment thread app/jobs/inat_import_job.rb Outdated

Comment thread app/jobs/inat_import_job.rb

Comment thread app/jobs/inat_import_recovery_job.rb Outdated

Comment thread app/jobs/inat_import_recovery_job.rb Outdated

JoeCohen marked this pull request as draft April 13, 2026 18:04

JoeCohen and others added 7 commits April 13, 2026 11:47

Suppress expected Rails.logger.error in safe_done test

c3ff7f3

Stub Rails.logger.error in test_safe_done_reraises_when_done_fails_on_happy_path to avoid ERROR lines in test output that could be mistaken for real failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

JoeCohen marked this pull request as ready for review April 14, 2026 22:48

mo-nathan reviewed Apr 20, 2026

View reviewed changes

mo-nathan approved these changes Apr 20, 2026

View reviewed changes

JoeCohen merged commit 5765e74 into main Apr 26, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stuck InatImport when worker crashes#4122

Fix stuck InatImport when worker crashes#4122
JoeCohen merged 9 commits intomainfrom
jdc-4119-import-hung-after-error

JoeCohen commented Apr 13, 2026 •

edited

Loading

Uh oh!

coveralls commented Apr 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mo-nathan Apr 20, 2026

Uh oh!

JoeCohen Apr 26, 2026

Uh oh!

mo-nathan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JoeCohen commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mo-nathan Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

JoeCohen Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

mo-nathan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JoeCohen commented Apr 13, 2026 •

edited

Loading

coveralls commented Apr 13, 2026 •

edited

Loading