We run bazel-remote cache as a gRPC server and have observed it leaks memory.
go_goroutines appears to increase in an unbounded fashion in proportion to the number of requests the cache serves. At our request current scale our cache crashes with an OOM every 2 days.
Configuration
Bazel Remote Version: 2.1.1
Disk: 1 TB
Memory: 300 GB
Observed OOM Crash
[Aug18 20:41] bazel-remote invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[ +0.003323] oom_kill_process+0x223/0x410
[ +0.006224] Out of memory: Kill process 48578 (bazel-remote) score 963 or sacrifice child
[ +0.012113] Killed process 48578 (bazel-remote) total-vm:391175744kB, anon-rss:388574312kB, file-rss:0kB, shmem-rss:0kB
Reproduction Steps
- Start a local gRPC bazel-remote cache instance.
- Clean and issue gRPC calls to the cache to fetch artifacts.
bazel clean
bazel build //<same target>/... --remote_cache=grpc://localhost:9092 --remote_upload_local_results
curl localhost:8080/metrics | grep gorou
- Repeat steps 2-3 and observe the number of goroutines grows in an unbounded fashion.
This issue does not reproduce when artifacts are fetched from the remote cache server over HTTP.
Go Routine Profile
Writes appear to be blocked for individual go routines.
goroutine 152 [select, 7 minutes]:
io.(*pipe).Write(0xc00009c720, 0xc00084a000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
GOROOT/src/io/pipe.go:94 +0x1c5
io.(*PipeWriter).Write(...)
GOROOT/src/io/pipe.go:163
github.com/buchgr/bazel-remote/server.(*grpcServer).Write.func2(0x1a7df90, 0xc0007a0980, 0xc0001141ba, 0xc0000a9e90, 0xc00009c7e0, 0xc0004ca120, 0xc00009c840, 0xc00009c780, 0xc0003080a8, 0xc0003080b0)
server/grpc_bytestream.go:474 +0x36b
created by github.com/buchgr/bazel-remote/server.(*grpcServer).Write
server/grpc_bytestream.go:384 +0x2ea
goroutine 250 [select, 42 minutes]:
io.(*pipe).Write(0xc000484180, 0xc000230000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
GOROOT/src/io/pipe.go:94 +0x1c5
io.(*PipeWriter).Write(...)
GOROOT/src/io/pipe.go:163
github.com/buchgr/bazel-remote/server.(*grpcServer).Write.func2(0x1a7df90, 0xc00040c740, 0xc0002b843a, 0xc000529530, 0xc000484240, 0xc000218990, 0xc0004842a0, 0xc0004841e0, 0xc00000e298, 0xc00000e2a8)
server/grpc_bytestream.go:474 +0x36b
created by github.com/buchgr/bazel-remote/server.(*grpcServer).Write
server/grpc_bytestream.go:384 +0x2ea
One Potential Culprit
https://github.com/buchgr/bazel-remote/blob/master/server/grpc_bytestream.go#L448
go func() {
err := s.cache.Put(cache.CAS, hash, size, rc)
putResult <- err
}()
putResult is in a nested channel that does not return. Upon subsequent iterations this code tries to push to recvResult, but nobody no one is reading from it which may lead to blocked writes.
We run bazel-remote cache as a gRPC server and have observed it leaks memory.
go_goroutinesappears to increase in an unbounded fashion in proportion to the number of requests the cache serves. At our request current scale our cache crashes with an OOM every 2 days.Configuration
Bazel Remote Version: 2.1.1
Disk: 1 TB
Memory: 300 GB
Observed OOM Crash
[Aug18 20:41] bazel-remote invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[ +0.003323] oom_kill_process+0x223/0x410
[ +0.006224] Out of memory: Kill process 48578 (bazel-remote) score 963 or sacrifice child
[ +0.012113] Killed process 48578 (bazel-remote) total-vm:391175744kB, anon-rss:388574312kB, file-rss:0kB, shmem-rss:0kB
Reproduction Steps
curl localhost:8080/metrics | grep gorouThis issue does not reproduce when artifacts are fetched from the remote cache server over HTTP.
Go Routine Profile
Writes appear to be blocked for individual go routines.
One Potential Culprit
https://github.com/buchgr/bazel-remote/blob/master/server/grpc_bytestream.go#L448
putResultis in a nested channel that does not return. Upon subsequent iterations this code tries to push torecvResult, but nobody no one is reading from it which may lead to blocked writes.