Skip to content

Fix stuck InatImport when worker crashes#4122

Merged
JoeCohen merged 9 commits intomainfrom
jdc-4119-import-hung-after-error
Apr 26, 2026
Merged

Fix stuck InatImport when worker crashes#4122
JoeCohen merged 9 commits intomainfrom
jdc-4119-import-hung-after-error

Conversation

@JoeCohen
Copy link
Copy Markdown
Member

@JoeCohen JoeCohen commented Apr 13, 2026

Addresses the "The one failed job" case described in a comment to Issue 4117

Add InatImportRecoveryJob, scheduled every 3 minutes, to find InatImport records stuck in Importing state (updated_at older than STUCK_THRESHOLD) and mark them Done with an error message. Addresses the case where a worker process is SIGKILL'd or OOM-killed and the job's ensure block never runs.

Also bulletproof InatImportJob#perform: add a rescue Exception clause (re-raising shutdown signals) to log unexpected errors to response_errors, and wrap the ensure's done call in safe_done so a DB failure there doesn't suppress the original exception.

Resolves #4119

Add InatImportRecoveryJob, scheduled every 3 minutes, to find
InatImport records stuck in Importing state (updated_at older than
STUCK_THRESHOLD) and mark them Done with an error message. Addresses
the case where a worker process is SIGKILL'd or OOM-killed and the
job's ensure block never runs.

Also bulletproof InatImportJob#perform: add a rescue Exception clause
(re-raising shutdown signals) to log unexpected errors to response_errors,
and wrap the ensure's done call in safe_done so a DB failure there
doesn't suppress the original exception.
@JoeCohen JoeCohen linked an issue Apr 13, 2026 that may be closed by this pull request
@JoeCohen JoeCohen self-assigned this Apr 13, 2026
@JoeCohen JoeCohen added bug iNat interaction with iNat labels Apr 13, 2026
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Apr 13, 2026

Coverage Status

coverage: 96.268% (+0.002%) from 96.266% — jdc-4119-import-hung-after-error into main

@JoeCohen JoeCohen marked this pull request as ready for review April 13, 2026 17:57
Copilot AI review requested due to automatic review settings April 13, 2026 17:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automated recovery for iNat imports that get stuck in the Importing state (e.g., when a worker is SIGKILL’d/OOM-killed) and hardens InatImportJob cleanup/error handling so import records don’t remain permanently “in progress”.

Changes:

  • Add InatImport::STUCK_THRESHOLD and InatImport#stuck? to detect stale “Importing” records.
  • Add InatImportRecoveryJob and schedule it (every 3 minutes in production) to mark stale imports Done and append an error message.
  • Update InatImportJob to handle unexpected exceptions and to wrap the final done call in safe_done, plus add tests for these behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
app/models/inat_import.rb Adds STUCK_THRESHOLD and stuck? predicate for stale-import detection.
app/jobs/inat_import_recovery_job.rb New recurring job that finds stale imports and marks them Done with an error message.
app/jobs/inat_import_job.rb Broadens exception handling and wraps done in safe_done from ensure.
config/recurring.yml Schedules InatImportRecoveryJob in production every 3 minutes.
test/models/inat_import_test.rb Adds unit tests for InatImport#stuck?.
test/jobs/inat_import_recovery_job_test.rb Adds tests for the recovery job behavior.
test/jobs/inat_import_job_test.rb Adds tests for safe_done behavior.

Comment thread app/jobs/inat_import_job.rb
Comment thread app/jobs/inat_import_job.rb Outdated
Comment thread app/jobs/inat_import_job.rb
Comment thread app/jobs/inat_import_recovery_job.rb Outdated
Comment thread app/jobs/inat_import_recovery_job.rb Outdated
@JoeCohen JoeCohen marked this pull request as draft April 13, 2026 18:04
JoeCohen and others added 7 commits April 13, 2026 11:47
Extract shutdown_signal? predicate to reduce perform's AbcSize,
replace $! with $ERROR_INFO (English stdlib), and correct the comment
that claimed the ensure block had "already run" before re-raise —
it runs during stack unwinding after re-raise, not before.

Also fix the ensure behavior: skip safe_done on shutdown signals so
the import record stays Importing and SolidQueue can requeue the job,
rather than prematurely marking it Done on graceful shutdown.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NoMemoryError, SystemStackError, and ScriptError should propagate
rather than be caught, logged, and swallowed — the worker's state is
undefined in those cases. Also renames shutdown_signal? to
non_rescuable? to reflect the broader intent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If done raises with no original exception in flight, re-raise so the
job fails visibly and SolidQueue can retry, rather than silently
leaving the import in Importing state and letting the recovery job
show a misleading "may have crashed" message. Only swallow done
failures when an exception is already being handled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the inline where chain in InatImportRecoveryJob with
InatImport.stuck, eliminating duplication and fixing an inconsistency
where the job used inclusive .. (<=) while stuck? used strict <.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Combine state, ended_at, and response_errors into a single update!
call so the transition to Done and the crash message are written
atomically and failures raise rather than silently returning false.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stub Rails.logger.error in test_safe_done_reraises_when_done_fails_on_happy_path
to avoid ERROR lines in test output that could be mistaken for real failures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover the rescue Exception body (lines hit when a non-StandardError,
non-fatal exception propagates through perform) and the NoMemoryError,
SystemStackError, and ScriptError branches of non_rescuable?, which
were unreachable via prior tests due to short-circuit evaluation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JoeCohen JoeCohen marked this pull request as ready for review April 14, 2026 22:48
Comment thread app/models/inat_import.rb

# An import stuck in Importing state longer than this is assumed to have
# crashed. Must match the schedule in config/recurring.yml.
STUCK_THRESHOLD = 3.minutes
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean any job that runs for over 3 minutes gets reaped?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC:
No — the key is updated_at, not total runtime. The stuck? method's comment explains it:

updated_at is touched after every observation import, so no update in longer than STUCK_THRESHOLD indicates a crashed worker.

A healthy import refreshes updated_at after each observation, so an import processing thousands of observations over an hour would never be flagged. Only an import that has gone silent — no activity for 3 minutes — is considered stuck, which is the signal that the worker crashed.

Copy link
Copy Markdown
Member

@mo-nathan mo-nathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending confirmation that the time delay is for updates while the job is running vs. the total time the job has run so far. Even so 3.minutes seems a bit short.

@JoeCohen JoeCohen merged commit 5765e74 into main Apr 26, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug iNat interaction with iNat

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import hung after Error

4 participants