Skip to content

Migrate HPC Backend to svc_compose and /projects/CRBM#116

Merged
jcschaff merged 2 commits into
mainfrom
migrate-hpc-svc-compose
May 18, 2026
Merged

Migrate HPC Backend to svc_compose and /projects/CRBM#116
jcschaff merged 2 commits into
mainfrom
migrate-hpc-svc-compose

Conversation

@jcschaff
Copy link
Copy Markdown
Contributor

Summary

Migrate compose-api off the legacy crbmapi HPC account + cfs09:/home/FCAM/crbmapi NFS share onto the new svc_compose service account + cfs15:/projects/CRBM share. Aligns dev/local SLURM partition with prod (vcell/vcell-services). Bundles two unrelated pre-existing bug fixes that were prerequisites for validating the migration with the live HPC.

Commits

  1. Fix Circular Import and SLURM Sentinel Crash in Job Monitor

    • Break simulation_service ↔ dependencies circular import (broke PyCharm's unittest loader)
    • Normalize SLURM sacct "Unknown" / "N/A" sentinel to None in SlurmJob.from_sacct_formatted_output — previously crashed the job_monitor poll loop on every PENDING job with ValueError: Invalid isoformat string: 'Unknown'
  2. Migrate HPC Backend to svc_compose and /projects/CRBM Storage

    • Container mountPath, PV nfs.path, and SLURM submit user updated across rke and local overlays
    • PV nfs.path narrowed from /projects to /projects/CRBM to match container mountPath (avoids path-doubling)
    • Local overlay SLURM partition aligned with prod (generalvcell); single-node mantis-039 pin removed
    • shared/ghcr/ssh sealed secrets re-encrypted for the new user/keypair
    • hpc_utils.format_experiment_path and config.internal_mount_dir default refreshed; stale cfs07:/ifs/vcell PVC comment removed

Verification

  • ssh svc_compose@hamantis with the new key works; svc_compose has rwx on /projects/CRBM (via crbm group) and /projects/CRBM/compose_api (via domain users group)
  • kubectl-deployable manifests (no schema changes; sealed secrets validated by re-sealing with kubeseal against the cluster's pubkey)
  • Test suite shakedown found no rename-related regressions. Remaining failures observed are pre-existing/out-of-scope: --nodelist=<19 nodes> without --nodes=1 in simulation_service.py:97 causes PartitionNodeLimit when SLURM_NODE_LIST is multi-node, and singularity_build_* jobs fail in 00:00:00 (needs separate investigation).

Test plan

  • Re-seal review: confirm kubectl get sealedsecret -n compose-api-rke decrypts the new payloads
  • Deploy compose-api-rke overlay to a staging cluster; verify pod can read/write /projects/CRBM/compose_api/prod/...
  • Submit a real simulation end-to-end as svc_compose; confirm sbatch lands in /projects/CRBM/compose_api/prod/slurm_sbatch/ and htclogs appear in htclogs/
  • job_monitor poll loop no longer logs Invalid isoformat string: 'Unknown' for PENDING jobs
  • Pre-existing files in /projects/CRBM/compose_api (still owned by crbmapi) remain readable; new files written by svc_compose are owned correctly

Follow-ups (not in this PR)

  • Cancel ~15 stale PENDING jobs accumulated under svc_compose during validation: scancel -u svc_compose
  • Fix simulation_service.py:97 to emit --nodes=1 alongside --nodelist= (so a multi-node list means "any one of these," not "all of them")
  • Investigate why singularity_build_* jobs fail instantly under svc_compose

jcschaff and others added 2 commits May 18, 2026 15:50
- simulation_service.py imported get_required_database_service from
  dependencies at module top; dependencies imports SimulationService
  back from simulation_service. Pytest tolerated the cycle by import
  order, but PyCharm's unittest loader failed with a partially-
  initialized-module ImportError. Made the dependencies import lazy
  inside the two call sites that use it (build_container,
  download_container).

- SLURM sacct emits the literal string "Unknown" for start/end times
  on PENDING jobs. SlurmJob.from_sacct_formatted_output passed it
  through, and hpc_db.update_hpcrun_status called
  datetime.fromisoformat("Unknown"), crashing the job_monitor poll
  loop once per pending job. Normalize "", "Unknown", "N/A" to None
  at the parser boundary; existing downstream `is not None` guards
  handle it cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move compose-api off the legacy crbmapi HPC account and
cfs09:/home/FCAM/crbmapi NFS share onto the svc_compose service
account and cfs15:/projects/CRBM share.

Manifests:
- api.yaml container mountPath /home/FCAM/crbmapi -> /projects/CRBM
- rke and local PV nfs.path narrowed to /projects/CRBM on cfs15
  (matches mountPath to avoid path-doubling)
- rke and local env files: SLURM_SUBMIT_USER, INTERNAL_MOUNT_DIR,
  SIMULATION_STORE_BASE_PATH, *_BASE_PATH updated
- local overlay SLURM partition aligned with prod
  (general/general -> vcell/vcell-services), single-node
  mantis-039 pin removed
- shared/ghcr/ssh sealed secrets re-encrypted for new user/keypair
- stale cfs07:/ifs/vcell PVC comments removed

Code:
- hpc_utils.format_experiment_path base /home/FCAM/crbmapi ->
  /projects/CRBM
- config.Settings.internal_mount_dir default /mnt/crpbmapi (typo)
  -> /mnt/projects/CRBM/compose_api, comment refreshed
- .dev_env_TEMPLATE: INTERNAL_MOUNT_DIR and
  SIMULATION_STORE_BASE_PATH updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit 83e9e26 into main May 18, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant