Skip to content

[Bug] The operator continuously reconciles the failed job without detecting the permanent failure #232

@yalzhang

Description

@yalzhang

When an ApprovedImage is applied with an image that causes compute-pcrs to fail permanently (e.g., corrupted PE header), the operator continuously reconciles the failed job without detecting the permanent failure. This results in resource waste and misleading status.

To reproduce:

  1. Apply a approvedimage CR which will cause the compute-pcrs job to fail
  2. Watch the status of the job, and check the log of the operator, the compute-pcrs pod

Current behavior:

  1. Exactly 7 pods of compute-pcrs-* (1 initial + 6 retries with backoffLimit: 6). Newest pod is 10m old - NO NEW PODS BEING CREATED (Kubernetes has stopped retrying correctly ✓)
$ oc get jobs 
NAME                                                              STATUS   COMPLETIONS   DURATION   AGE
compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo   Failed   0/1           67m        67m
$ oc get job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo -o yaml
......
status:
  conditions:
  - lastProbeTime: "2026-04-02T05:58:34Z"
    lastTransitionTime: "2026-04-02T05:58:34Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: FailureTarget
  - lastProbeTime: "2026-04-02T05:58:34Z"
    lastTransitionTime: "2026-04-02T05:58:34Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 7
  ready: 0
  startTime: "2026-04-02T05:46:57Z"
  terminating: 0
  uncountedTerminatedPods: {}
  1. The operator continues to reconcile every 300s, which is unexpected
$ oc get approvedimage  image-latest -o yaml
......
status:
  conditions:
  - lastTransitionTime: "2026-04-02T05:46:58Z"
    message: Computation is ongoing. Check jobs for progress.
    observedGeneration: 1
    reason: Computing  <------------------------------------------- still show computing
    status: "False"
    type: Committed

$ oc logs -f confidential-cluster-operator-6c7f547f8-km8p5 --timestamps
...
2026-04-02T06:58:34.627721122Z [INFO  kube_runtime::controller] reconciling object; object.ref=Job.v1.batch/compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo.confidential-clusters object.reason=reconciler requested retry
2026-04-02T06:58:34.627721122Z [INFO  operator::reference_values] Job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo changed, but had not completed
2026-04-02T06:58:34.627803038Z [INFO  operator] reconciled (ObjectRef { dyntype: (), name: "compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo", namespace: Some("confidential-clusters"), extra: Extra { resource_version: Some("25925947"), uid: Some("494ef462-9837-4002-b3d4-60d783447c25") } }, Action { requeue_after: Some(300s) })

Expected result:

  • Detect Permanent Failures
  • After backoffLimit (6) retries, detect that the job has permanently failed
  • Update ApprovedImage status to reason: Failed
  • Stop retrying the computation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions