Migrate HPC Backend to svc_compose and /projects/CRBM#116
Merged
Conversation
- simulation_service.py imported get_required_database_service from
dependencies at module top; dependencies imports SimulationService
back from simulation_service. Pytest tolerated the cycle by import
order, but PyCharm's unittest loader failed with a partially-
initialized-module ImportError. Made the dependencies import lazy
inside the two call sites that use it (build_container,
download_container).
- SLURM sacct emits the literal string "Unknown" for start/end times
on PENDING jobs. SlurmJob.from_sacct_formatted_output passed it
through, and hpc_db.update_hpcrun_status called
datetime.fromisoformat("Unknown"), crashing the job_monitor poll
loop once per pending job. Normalize "", "Unknown", "N/A" to None
at the parser boundary; existing downstream `is not None` guards
handle it cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move compose-api off the legacy crbmapi HPC account and cfs09:/home/FCAM/crbmapi NFS share onto the svc_compose service account and cfs15:/projects/CRBM share. Manifests: - api.yaml container mountPath /home/FCAM/crbmapi -> /projects/CRBM - rke and local PV nfs.path narrowed to /projects/CRBM on cfs15 (matches mountPath to avoid path-doubling) - rke and local env files: SLURM_SUBMIT_USER, INTERNAL_MOUNT_DIR, SIMULATION_STORE_BASE_PATH, *_BASE_PATH updated - local overlay SLURM partition aligned with prod (general/general -> vcell/vcell-services), single-node mantis-039 pin removed - shared/ghcr/ssh sealed secrets re-encrypted for new user/keypair - stale cfs07:/ifs/vcell PVC comments removed Code: - hpc_utils.format_experiment_path base /home/FCAM/crbmapi -> /projects/CRBM - config.Settings.internal_mount_dir default /mnt/crpbmapi (typo) -> /mnt/projects/CRBM/compose_api, comment refreshed - .dev_env_TEMPLATE: INTERNAL_MOUNT_DIR and SIMULATION_STORE_BASE_PATH updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrate compose-api off the legacy
crbmapiHPC account +cfs09:/home/FCAM/crbmapiNFS share onto the newsvc_composeservice account +cfs15:/projects/CRBMshare. Aligns dev/local SLURM partition with prod (vcell/vcell-services). Bundles two unrelated pre-existing bug fixes that were prerequisites for validating the migration with the live HPC.Commits
Fix Circular Import and SLURM Sentinel Crash in Job Monitor
simulation_service ↔ dependenciescircular import (broke PyCharm's unittest loader)"Unknown"/"N/A"sentinel toNoneinSlurmJob.from_sacct_formatted_output— previously crashed thejob_monitorpoll loop on every PENDING job withValueError: Invalid isoformat string: 'Unknown'Migrate HPC Backend to svc_compose and /projects/CRBM Storage
mountPath, PVnfs.path, and SLURM submit user updated acrossrkeandlocaloverlaysnfs.pathnarrowed from/projectsto/projects/CRBMto match container mountPath (avoids path-doubling)general→vcell); single-nodemantis-039pin removedshared/ghcr/sshsealed secrets re-encrypted for the new user/keypairhpc_utils.format_experiment_pathandconfig.internal_mount_dirdefault refreshed; stalecfs07:/ifs/vcellPVC comment removedVerification
ssh svc_compose@hamantiswith the new key works; svc_compose has rwx on/projects/CRBM(viacrbmgroup) and/projects/CRBM/compose_api(viadomain usersgroup)kubectl-deployable manifests (no schema changes; sealed secrets validated by re-sealing withkubesealagainst the cluster's pubkey)--nodelist=<19 nodes>without--nodes=1insimulation_service.py:97causesPartitionNodeLimitwhenSLURM_NODE_LISTis multi-node, andsingularity_build_*jobs fail in 00:00:00 (needs separate investigation).Test plan
kubectl get sealedsecret -n compose-api-rkedecrypts the new payloadscompose-api-rkeoverlay to a staging cluster; verify pod can read/write/projects/CRBM/compose_api/prod/.../projects/CRBM/compose_api/prod/slurm_sbatch/and htclogs appear inhtclogs/job_monitorpoll loop no longer logsInvalid isoformat string: 'Unknown'for PENDING jobs/projects/CRBM/compose_api(still owned bycrbmapi) remain readable; new files written by svc_compose are owned correctlyFollow-ups (not in this PR)
scancel -u svc_composesimulation_service.py:97to emit--nodes=1alongside--nodelist=(so a multi-node list means "any one of these," not "all of them")singularity_build_*jobs fail instantly under svc_compose