Skip to content

commands.connect to a resumed sandbox stops delivering stdout after a large response #1352

@nseniak

Description

@nseniak

Hi E2B team — thanks for the platform, we really enjoy building on it! We run long-lived stdio MCP servers inside sandboxes and lean heavily on on_timeout=pause / auto_resume, and we think we've hit a streaming-reconnect bug. We put together a tiny repro to make it easy to poke at. 🙏

TL;DR: after reattaching to a backgrounded process with commands.connect(pid) on a resumed sandbox, on_stdout stops firing for output that follows a large (~8KB) response — even though the process keeps writing and send_stdin keeps working. A stream that only emits small responses is totally fine; a single ~8KB response is enough to wedge it.

Environment

  • e2b Python SDK 2.20.2 (also present on 2.23.1)
  • Python 3.12, public base template, region eu-west-1
  • Sandboxes created with lifecycle={"on_timeout": "pause", "auto_resume": True}
  • Background command via commands.run(cmd, background=True, stdin=True, timeout=0), reattached with commands.connect(pid, on_stdout=…, timeout=0)

Repro (SDK-only, public base template)

import asyncio
from e2b import AsyncSandbox

# Reads a line; "BIG" -> ~8KB response, anything else -> echo it back.
PROC = ("python3 -u -c \"import sys\n"
        "for line in sys.stdin:\n"
        "    sys.stdout.write(('X'*8000 if line.startswith('BIG') else line.strip())+'\\n')\n"
        "    sys.stdout.flush()\"")

async def main():
    sbx = await AsyncSandbox.create(
        template="base", timeout=20,
        lifecycle={"on_timeout": "pause", "auto_resume": True},
    )
    got: list[str] = []
    proc = await sbx.commands.run(PROC, background=True, stdin=True, timeout=0,
                                  on_stdout=lambda s: got.append(s))
    pid = proc.pid
    handle = proc
    for c in range(1, 13):
        await handle.disconnect()
        await sbx.pause()                 # force a pause
        await asyncio.sleep(1)
        handle = await sbx.commands.connect(pid, timeout=0,
                                            on_stdout=lambda s: got.append(s))
        before = len(got)
        await sbx.commands.send_stdin(pid, "BIG\n")            # large response
        for i in range(3):
            await sbx.commands.send_stdin(pid, f"s{c}-{i}\n")  # small responses
        await asyncio.sleep(3)
        delivered = len(got) - before
        print(f"cycle {c}: sent 4 (1 big + 3 small), got {delivered}"
              f"{'  <-- STALL' if delivered < 4 else ''}")

asyncio.run(main())

Observed (SDK 2.20.2, base)

cycle 1: sent 4 (1 big + 3 small), got 4
cycle 2: sent 4 (1 big + 3 small), got 4
cycle 3: sent 4 (1 big + 3 small), got 4
cycle 4: sent 4 (1 big + 3 small), got 1   <-- STALL  (only the big came back)
cycle 5: sent 4 (1 big + 3 small), got 0   <-- STALL  (stream stays wedged)
...
DONE: 9/12 stalled

The all-small control (replace "BIG\n" with another small line) gives 0/12 stalled, which is what points at the large response as the trigger.

What we'd expect

After commands.connect to a running pid on a resumed sandbox, on_stdout should keep delivering all subsequent process output for the life of the connection, regardless of an earlier large response.

A couple of related things we noticed

  • Sometimes commands.connect itself hangs ~60s and then fails with TimeoutException: The sandbox is running but port is not open; request_timeout doesn't seem to bound it (feels related to #1128).
  • A fresh commands.run (new process) on the same resumed sandbox always streams fine — so it looks specific to reconnecting to the existing pid.

Our workaround (for context)

We cap the connect establish with an external asyncio.wait_for, and when a reattached session goes quiet we drop it and commands.run a fresh process. It works, but we lose the warm process on each flaky reattach.

Possibly related: #1128, #857, #1031. Happy to share raw event traces or test a patch — thanks so much for taking a look!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions