-
Notifications
You must be signed in to change notification settings - Fork 919
[BUG]: Long-running jobs lose authentication and fail at finalization #5520
Copy link
Copy link
Open
Labels
Description
What happened?
We are running long GPU workloads (~4+ days) on a self-hosted Azure Pipelines agent. The job itself continues executing successfully (Docker container keeps running), but after ~24 hours the pipeline loses authentication and eventually fails with:
We stopped hearing from agent
Observed Behavior
- The agent service remains running for the entire duration
- The VM is healthy (no reboot, no resource exhaustion)
- The Docker container continues running successfully
- After ~24 hours:
- System.AccessToken becomes invalid/expired
- Azure DevOps API calls return:
- 401 Unauthorized
- or 203 + login page (Anonymous)
At job finalization:
- log upload fails
- timeline updates fail
- job fails with agent communication error
Versions
Agent version 4.270.0
Environment type (Please select at least one enviroment where you face this issue)
- Self-Hosted
- Microsoft Hosted
- VMSS Pool
- Container
Azure DevOps Server type
dev.azure.com (formerly visualstudio.com)
Azure DevOps Server Version (if applicable)
No response
Operation system
No response
Version controll system
No response
Relevant log output
Authentication failed with status code 401
POST .../_apis/distributedtask/.../logs is not authorized. VS30063
PATCH .../_apis/distributedtask/.../timelines/... is not authorized. VS30063
Fail to update timeline records with output variables.
Throw exception to fail the job since output variables are critical to downstream jobs
Pipeline
##[error]We stopped hearing from agent ******.
Verify the agent machine is running and has a healthy network connection.Reactions are currently unavailable