Skip to content

bytestream write can leak goroutines if disk.Put doesn't drain the io.Reader #473

@BoyangTian-Robinhood

Description

@BoyangTian-Robinhood

We run bazel-remote cache as a gRPC server and have observed it leaks memory.

 
go_goroutines appears to increase in an unbounded fashion in proportion to the number of requests the cache serves. At our request current scale our cache crashes with an OOM every 2 days.
 
Configuration
 
Bazel Remote Version: 2.1.1
Disk: 1 TB
Memory: 300 GB
 
Observed OOM Crash
 
[Aug18 20:41] bazel-remote invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[ +0.003323] oom_kill_process+0x223/0x410
[ +0.006224] Out of memory: Kill process 48578 (bazel-remote) score 963 or sacrifice child
[ +0.012113] Killed process 48578 (bazel-remote) total-vm:391175744kB, anon-rss:388574312kB, file-rss:0kB, shmem-rss:0kB
 
Reproduction Steps
 

  1. Start a local gRPC bazel-remote cache instance.
  2. Clean and issue gRPC calls to the cache to fetch artifacts.
     
bazel clean
bazel build //<same target>/... --remote_cache=grpc://localhost:9092 --remote_upload_local_results
  1. curl localhost:8080/metrics | grep gorou
  2. Repeat steps 2-3 and observe the number of goroutines grows in an unbounded fashion.
     
    This issue does not reproduce when artifacts are fetched from the remote cache server over HTTP.
     
    Go Routine Profile
     
    Writes appear to be blocked for individual go routines.
     
goroutine 152 [select, 7 minutes]:
io.(*pipe).Write(0xc00009c720, 0xc00084a000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
GOROOT/src/io/pipe.go:94 +0x1c5
io.(*PipeWriter).Write(...)
GOROOT/src/io/pipe.go:163
github.com/buchgr/bazel-remote/server.(*grpcServer).Write.func2(0x1a7df90, 0xc0007a0980, 0xc0001141ba, 0xc0000a9e90, 0xc00009c7e0, 0xc0004ca120, 0xc00009c840, 0xc00009c780, 0xc0003080a8, 0xc0003080b0)
server/grpc_bytestream.go:474 +0x36b
created by github.com/buchgr/bazel-remote/server.(*grpcServer).Write
server/grpc_bytestream.go:384 +0x2ea
 
goroutine 250 [select, 42 minutes]:
io.(*pipe).Write(0xc000484180, 0xc000230000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
GOROOT/src/io/pipe.go:94 +0x1c5
io.(*PipeWriter).Write(...)
GOROOT/src/io/pipe.go:163
github.com/buchgr/bazel-remote/server.(*grpcServer).Write.func2(0x1a7df90, 0xc00040c740, 0xc0002b843a, 0xc000529530, 0xc000484240, 0xc000218990, 0xc0004842a0, 0xc0004841e0, 0xc00000e298, 0xc00000e2a8)
server/grpc_bytestream.go:474 +0x36b
created by github.com/buchgr/bazel-remote/server.(*grpcServer).Write
server/grpc_bytestream.go:384 +0x2ea
 

 
One Potential Culprit
 
https://github.com/buchgr/bazel-remote/blob/master/server/grpc_bytestream.go#L448
 

go func() {
err := s.cache.Put(cache.CAS, hash, size, rc)
putResult <- err
}()

 
putResult is in a nested channel that does not return. Upon subsequent iterations this code tries to push to recvResult, but nobody no one is reading from it which may lead to blocked writes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions