Skip to content

[BUG]: Long-running jobs lose authentication and fail at finalization #5520

@serhii-kuzniechykov

Description

@serhii-kuzniechykov

What happened?

We are running long GPU workloads (~4+ days) on a self-hosted Azure Pipelines agent. The job itself continues executing successfully (Docker container keeps running), but after ~24 hours the pipeline loses authentication and eventually fails with:

We stopped hearing from agent

Observed Behavior

  • The agent service remains running for the entire duration
  • The VM is healthy (no reboot, no resource exhaustion)
  • The Docker container continues running successfully
  • After ~24 hours:
  • System.AccessToken becomes invalid/expired
  • Azure DevOps API calls return:
  • 401 Unauthorized
  • or 203 + login page (Anonymous)

At job finalization:

  • log upload fails
  • timeline updates fail
  • job fails with agent communication error

Versions

Agent version 4.270.0

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

No response

Version controll system

No response

Relevant log output

Authentication failed with status code 401
POST .../_apis/distributedtask/.../logs is not authorized. VS30063

PATCH .../_apis/distributedtask/.../timelines/... is not authorized. VS30063

Fail to update timeline records with output variables.
Throw exception to fail the job since output variables are critical to downstream jobs

Pipeline
##[error]We stopped hearing from agent ******.
Verify the agent machine is running and has a healthy network connection.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions