Void model - pass 1 & 2 (CORE-38)#13403
Conversation
58cc799 to
a0a69c9
Compare
b68042b to
12809a0
Compare
|
I've removed the rp dependency for computation of optical flow. |
|
Important Review skippedReview was skipped as selected files did not have any reviewable changes. 💤 Files selected but had no reviewable changes (1)
⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds CogVideoX support: expands 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_void.py`:
- Around line 155-162: The quadmask resizing currently uses
comfy.utils.common_upscale with "bilinear", which creates interpolated values
and destroys the discrete 4-class mask; change the upscale call in the quadmask
flow (variable qm) to use nearest-neighbor sampling ("nearest-exact") or, if you
must keep bilinear, immediately re-quantize qm back to the original 4 discrete
classes before it is used to build masked_video or encoded into the mask latent;
update the comfy.utils.common_upscale invocation that moves dims on qm (and any
subsequent qm processing) to apply the nearest-exact mode or a quantization step
to restore class indices.
- Around line 295-296: The current use of warped_tensor.repeat(batch_size, 1, 1,
1, 1) duplicates the identical warped-noise across the batch (in function/method
where warped_tensor and batch_size are used), producing identical outputs per
item; change this by either (A) generating distinct per-batch warped
tensors/noise before repeating so each batch index has unique noise (replace the
single warped_tensor with a batch-sized tensor of independent warps), or (B) if
per-batch noise is not implemented yet, cap the effective batch replication to 1
(e.g., use effective_batch = min(batch_size, 1) and only repeat when
effective_batch > 1) and remove or disable the batch_size input until unique
per-batch noise is implemented; update any references to warped_tensor.repeat
accordingly.
- Around line 21-38: The helper _valid_void_length currently maps any input <5
up to 5, breaking the “round down” contract; change it to explicitly reject
unsupported very-short clips by validating the input against a minimum allowed
frame count (e.g. 5) using the same constants TEMPORAL_COMPRESSION and
PATCH_SIZE_T, and raise a clear exception (ValueError) when length < minimum;
also update the other affected spots (the similar logic around the blocks
referenced at lines 119-123 and 228-230) or the input schemas so they enforce
the same minimum rather than allowing silent expansion.
- Around line 219-239: VOIDWarpedNoise currently produces non-reproducible noise
because no seed is threaded through; update the node to accept a seed input in
define_schema and propagate that seed into execute so it is passed to
get_noise_from_video (in comfy_extras/void_noise_warp.py) instead of using
hard-coded seed=0, and modify Noise_FromLatent usage to initialize/clone with
the provided seed rather than 0; ensure all other occurrences mentioned (around
lines 270-277 and 302-309) similarly accept and forward the seed parameter to
maintain deterministic sampling across Pass 2.
In `@comfy_extras/void_noise_warp.py`:
- Around line 355-358: The code currently caches a RAFT model instance already
moved to device (see raft_large and self.model in _get_raft_model), which pins
CUDA memory; instead ensure the cached object is CPU-resident or cache only a
CPU state (e.g., state_dict) and perform the .to(device) only when
returning/using the model. Update _get_raft_model so the cached model is kept on
CPU (or store its state_dict) and any call that needs the model moves it to the
requested device (and moves it back or discards the device copy after use) to
avoid permanently pinning VRAM; apply the same change to the other cache site
around lines 390-394 that likewise sets model.to(device).
- Around line 452-456: The code currently moves and converts the entire
video_frames tensor to device and permutes it into frames, causing O(T) VRAM;
instead, keep video_frames on CPU and only transfer/convert the two frames
needed for each RAFT pass (prev and curr). Change logic around the variable
frames/video_frames so that you index two frames at a time from video_frames on
CPU, permute/to(torch.float32)/div255 and move each selected frame to device
before resizing with F.interpolate, use max(1, ...) for new_h/new_w as before,
and release or detach the device tensors after the RAFT step; apply the same
per-frame transfer/resizing pattern for the other block referenced (lines around
489-495).
- Around line 363-365: Validate input height/width before the resize/downscale
path: ensure h and w (and the computed new_h = (h // 8) * 8 and new_w = (w // 8)
* 8) are not zero and clamp them to a safe minimum (e.g., at least 1, or to 8 if
you require multiples of 8) before calling _torch_resize_chw; if clamping would
change aspect or is invalid, raise a clear ValueError describing the tiny input
so callers can handle it. Apply the same guard/clamp logic in the other
resize/downscale blocks (the regions around the downscale() usage and the blocks
mentioned at lines 461-478 and 480-485) so RAFT preprocessing that floors a
dimension to 0 cannot lead to a zero-sized tensor being passed into
_torch_resize_chw or downscale().
- Line 365: After upsampling the flow field spatially with _torch_resize_chw
(the same call used for image at "image = _torch_resize_chw(image, (new_h,
new_w), ...)" ), rescale the flow vectors so their magnitudes match the original
resolution by multiplying the x (horizontal) channel by (orig_width / new_w) and
the y (vertical) channel by (orig_height / new_h); apply this correction right
after the resize that returns flow to source size (and repeat the same fix in
the other block around lines 381-384) while preserving tensor dtype/device.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa0363b5-b5b4-4c26-bb4a-57fbe9c35a7c
📒 Files selected for processing (7)
comfy/latent_formats.pycomfy/sd.pycomfy/supported_models.pycomfy/text_encoders/cogvideo.pycomfy_extras/nodes_void.pycomfy_extras/void_noise_warp.pynodes.py
| def _valid_void_length(length: int) -> int: | ||
| """Round ``length`` down to a value that produces an even latent_t. | ||
|
|
||
| VOID / CogVideoX-Fun-V1.5 uses patch_size_t=2, so the VAE-encoded latent | ||
| must have an even temporal dimension. If latent_t is odd, the transformer | ||
| pad_to_patch_size circular-wraps an extra latent frame onto the end; after | ||
| the post-transformer crop the last real latent frame has been influenced | ||
| by the wrapped phantom frame, producing visible jitter and "disappearing" | ||
| subjects near the end of the decoded video. Rounding down fixes this. | ||
| """ | ||
| latent_t = ((length - 1) // TEMPORAL_COMPRESSION) + 1 | ||
| if latent_t % PATCH_SIZE_T == 0: | ||
| return length | ||
| # Round latent_t down to the nearest multiple of PATCH_SIZE_T, then invert | ||
| # the ((length - 1) // TEMPORAL_COMPRESSION) + 1 formula. Floor at 1 frame | ||
| # so we never return a non-positive length. | ||
| target_latent_t = max(PATCH_SIZE_T, (latent_t // PATCH_SIZE_T) * PATCH_SIZE_T) | ||
| return (target_latent_t - 1) * TEMPORAL_COMPRESSION + 1 |
There was a problem hiding this comment.
Lengths 1..4 are silently expanded to five frames.
_valid_void_length() returns 5 for any input below 5, so the UI contract to “round down” is false for the lowest allowed values. The downstream padding/resampling paths then synthesize extra timesteps from repeated last-frame data instead of honoring the requested clip length. Please either reject <5 in both schemas or make the helper fail explicitly for unsupported short clips.
Also applies to: 119-123, 228-230
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_void.py` around lines 21 - 38, The helper
_valid_void_length currently maps any input <5 up to 5, breaking the “round
down” contract; change it to explicitly reject unsupported very-short clips by
validating the input against a minimum allowed frame count (e.g. 5) using the
same constants TEMPORAL_COMPRESSION and PATCH_SIZE_T, and raise a clear
exception (ValueError) when length < minimum; also update the other affected
spots (the similar logic around the blocks referenced at lines 119-123 and
228-230) or the input schemas so they enforce the same minimum rather than
allowing silent expansion.
| qm = quadmask[:length] | ||
| if qm.ndim == 3: | ||
| qm = qm.unsqueeze(-1) | ||
| qm = comfy.utils.common_upscale( | ||
| qm.movedim(-1, 1), width, height, "bilinear", "center" | ||
| ).movedim(1, -1) | ||
| if qm.ndim == 4 and qm.shape[-1] == 1: | ||
| qm = qm.squeeze(-1) |
There was a problem hiding this comment.
Preserve the quadmask’s discrete classes when resizing.
This is a 4-level semantic mask, but resizing it with "bilinear" invents intermediate values before you build masked_video and encode the mask latent. That blurs the class boundaries the preprocess node just quantized. Use "nearest-exact" here, or re-quantize after the resize.
Suggested fix
qm = comfy.utils.common_upscale(
- qm.movedim(-1, 1), width, height, "bilinear", "center"
+ qm.movedim(-1, 1), width, height, "nearest-exact", "center"
).movedim(1, -1)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| qm = quadmask[:length] | |
| if qm.ndim == 3: | |
| qm = qm.unsqueeze(-1) | |
| qm = comfy.utils.common_upscale( | |
| qm.movedim(-1, 1), width, height, "bilinear", "center" | |
| ).movedim(1, -1) | |
| if qm.ndim == 4 and qm.shape[-1] == 1: | |
| qm = qm.squeeze(-1) | |
| qm = quadmask[:length] | |
| if qm.ndim == 3: | |
| qm = qm.unsqueeze(-1) | |
| qm = comfy.utils.common_upscale( | |
| qm.movedim(-1, 1), width, height, "nearest-exact", "center" | |
| ).movedim(1, -1) | |
| if qm.ndim == 4 and qm.shape[-1] == 1: | |
| qm = qm.squeeze(-1) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_void.py` around lines 155 - 162, The quadmask resizing
currently uses comfy.utils.common_upscale with "bilinear", which creates
interpolated values and destroys the discrete 4-class mask; change the upscale
call in the quadmask flow (variable qm) to use nearest-neighbor sampling
("nearest-exact") or, if you must keep bilinear, immediately re-quantize qm back
to the original 4 discrete classes before it is used to build masked_video or
encoded into the mask latent; update the comfy.utils.common_upscale invocation
that moves dims on qm (and any subsequent qm processing) to apply the
nearest-exact mode or a quantization step to restore class indices.
| @classmethod | ||
| def define_schema(cls): | ||
| return io.Schema( | ||
| node_id="VOIDWarpedNoise", | ||
| category="latent/video", | ||
| inputs=[ | ||
| io.Image.Input("video", tooltip="Pass 1 output video frames [T, H, W, 3]"), | ||
| io.Int.Input("width", default=672, min=16, max=nodes.MAX_RESOLUTION, step=8), | ||
| io.Int.Input("height", default=384, min=16, max=nodes.MAX_RESOLUTION, step=8), | ||
| io.Int.Input("length", default=45, min=1, max=nodes.MAX_RESOLUTION, step=1, | ||
| tooltip="Number of pixel frames. Rounded down to make latent_t " | ||
| "even (patch_size_t=2 requirement), e.g. 49 → 45."), | ||
| io.Int.Input("batch_size", default=1, min=1, max=64), | ||
| ], | ||
| outputs=[ | ||
| io.Latent.Output(display_name="warped_noise"), | ||
| ], | ||
| ) | ||
|
|
||
| @classmethod | ||
| def execute(cls, video, width, height, length, batch_size) -> io.NodeOutput: |
There was a problem hiding this comment.
Thread a seed through warped-noise generation.
VOIDWarpedNoise currently bakes stochastic noise without any seed input, comfy_extras/void_noise_warp.py:401-425 shows get_noise_from_video(...) has no seed parameter, and Noise_FromLatent hard-codes seed = 0 while just cloning the precomputed tensor. That means a fixed sampler seed cannot reproduce Pass 2 results once this node is in the graph.
Also applies to: 270-277, 302-309
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_void.py` around lines 219 - 239, VOIDWarpedNoise currently
produces non-reproducible noise because no seed is threaded through; update the
node to accept a seed input in define_schema and propagate that seed into
execute so it is passed to get_noise_from_video (in
comfy_extras/void_noise_warp.py) instead of using hard-coded seed=0, and modify
Noise_FromLatent usage to initialize/clone with the provided seed rather than 0;
ensure all other occurrences mentioned (around lines 270-277 and 302-309)
similarly accept and forward the seed parameter to maintain deterministic
sampling across Pass 2.
| if batch_size > 1: | ||
| warped_tensor = warped_tensor.repeat(batch_size, 1, 1, 1, 1) |
There was a problem hiding this comment.
batch_size > 1 currently duplicates the same sample.
Repeating the exact same warped-noise tensor across the batch makes every item follow the same deterministic trajectory, so this node’s batch_size knob produces duplicate videos rather than multiple candidates. If distinct per-batch noise is not implemented yet, it would be safer to cap this at 1 or remove the input for now.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_void.py` around lines 295 - 296, The current use of
warped_tensor.repeat(batch_size, 1, 1, 1, 1) duplicates the identical
warped-noise across the batch (in function/method where warped_tensor and
batch_size are used), producing identical outputs per item; change this by
either (A) generating distinct per-batch warped tensors/noise before repeating
so each batch index has unique noise (replace the single warped_tensor with a
batch-sized tensor of independent warps), or (B) if per-batch noise is not
implemented yet, cap the effective batch replication to 1 (e.g., use
effective_batch = min(batch_size, 1) and only repeat when effective_batch > 1)
and remove or disable the batch_size input until unique per-batch noise is
implemented; update any references to warped_tensor.repeat accordingly.
| new_h = (h // 8) * 8 | ||
| new_w = (w // 8) * 8 | ||
| image = _torch_resize_chw(image, (new_h, new_w), "bilinear", copy=False) |
There was a problem hiding this comment.
Validate tiny inputs before the resize/downscale path.
Small or aggressively downscaled clips can still crash here: the RAFT preprocess can floor a dimension to 0, and the output buffer uses floor division while downscale() clamps to at least 1x1. A clear upfront validation would avoid shape mismatches and interpolate errors.
Also applies to: 461-478, 480-485
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/void_noise_warp.py` around lines 363 - 365, Validate input
height/width before the resize/downscale path: ensure h and w (and the computed
new_h = (h // 8) * 8 and new_w = (w // 8) * 8) are not zero and clamp them to a
safe minimum (e.g., at least 1, or to 8 if you require multiples of 8) before
calling _torch_resize_chw; if clamping would change aspect or is invalid, raise
a clear ValueError describing the tiny input so callers can handle it. Apply the
same guard/clamp logic in the other resize/downscale blocks (the regions around
the downscale() usage and the blocks mentioned at lines 461-478 and 480-485) so
RAFT preprocessing that floors a dimension to 0 cannot lead to a zero-sized
tensor being passed into _torch_resize_chw or downscale().
| frames = video_frames.to(device).permute(0, 3, 1, 2).to(torch.float32) / 255.0 | ||
| if resize_frames != 1.0: | ||
| new_h = max(1, int(frames.shape[2] * resize_frames)) | ||
| new_w = max(1, int(frames.shape[3] * resize_frames)) | ||
| frames = F.interpolate(frames, size=(new_h, new_w), mode="area") |
There was a problem hiding this comment.
Don’t upload the full clip to the execution device up front.
This converts and stores every frame on device, but the loop only ever needs prev and curr. On longer videos that turns a two-frame RAFT pass into O(T) VRAM usage and makes VOID much easier to OOM next to the diffusion model.
Also applies to: 489-495
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/void_noise_warp.py` around lines 452 - 456, The code currently
moves and converts the entire video_frames tensor to device and permutes it into
frames, causing O(T) VRAM; instead, keep video_frames on CPU and only
transfer/convert the two frames needed for each RAFT pass (prev and curr).
Change logic around the variable frames/video_frames so that you index two
frames at a time from video_frames on CPU, permute/to(torch.float32)/div255 and
move each selected frame to device before resizing with F.interpolate, use
max(1, ...) for new_h/new_w as before, and release or detach the device tensors
after the RAFT step; apply the same per-frame transfer/resizing pattern for the
other block referenced (lines around 489-495).
04181ef to
d56887a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
comfy_extras/void_noise_warp.py (4)
367-374:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate tiny clips before RAFT preprocessing / downscale allocation.
Line 370-372can collapse either dimension to0, andLine 471-474can also allocate a zero-sized output when the internal noise resolution is smaller thandownscale_factor. In both cases the firstF.interpolateor assignment will fail instead of raising a clear error.Suggested direction
def _preprocess(self, image_chw): image = image_chw.to(self.device, torch.float32) _, h, w = image.shape + if h < 8 or w < 8: + raise ValueError( + f"RaftOpticalFlow: input too small for 8x preprocessing, got {h}x{w}" + ) new_h = (h // 8) * 8 new_w = (w // 8) * 8Also applies to: 471-482
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/void_noise_warp.py` around lines 367 - 374, The _preprocess function can compute new_h or new_w == 0 (collapsing dimensions) and later RAFT/downscale allocation (where downscale_factor and F.interpolate are used) can produce zero-sized tensors; add guards to both: in _preprocess (and the RAFT/noise resolution allocation code path) check that computed new_h and new_w are > 0 (and that image height/width are >= downscale_factor) and raise a clear ValueError if not, before calling _torch_resize_chw or F.interpolate/allocating outputs; include the failing image size, computed new_h/new_w and downscale_factor in the error message to aid debugging.
388-391:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScale the flow vectors after resizing them back to source size.
Line 390only upsamples the flow field spatially; the x/y displacements still need to be multiplied by the resize ratios. Without that correction, the warp is too weak whenever RAFT output is resized back to(h, w).Suggested fix
list_of_flows = self.model(img1, img2) flow = list_of_flows[-1][0] # (2, new_h, new_w) if flow.shape[-2:] != (h, w): + src_h, src_w = flow.shape[-2:] flow = _torch_resize_chw(flow, (h, w), "bilinear", copy=False) + flow[0] *= w / src_w + flow[1] *= h / src_h return flow🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/void_noise_warp.py` around lines 388 - 391, When resizing the flow back to (h, w) you must also scale the x/y displacement channels by the spatial resize ratios: before calling _torch_resize_chw on flow (the tensor from list_of_flows[-1][0] with shape (2, orig_h, orig_w)), save orig_h, orig_w = flow.shape[-2:], compute scale_x = w / orig_w and scale_y = h / orig_h, then after the resize multiply flow[0,...] *= scale_x and flow[1,...] *= scale_y (or the equivalent channel indices in your flow convention) so the displacement magnitudes match the source resolution.
357-365:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep the RAFT weights off the execution device until inference.
Line 362moves the cached model ontodeviceduring construction, so everyRaftOpticalFlowinstance pins its weights in VRAM for the rest of the process. That defeats ComfyUI’s offload behavior and raises memory pressure for later nodes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/void_noise_warp.py` around lines 357 - 365, The constructor for RaftOpticalFlow currently moves the model onto device (model = model.to(device)) which pins weights in VRAM; remove that call so the model stays on whatever device it was cached (do not call .to(device) in __init__), keep model.eval() and store self.model = model and self.device = device, and instead move the model to the execution device right before/inside the inference routine (e.g., in the method that computes flow such as compute_flow/forward/run_inference — call model.to(self.device) there, run inference, then optionally move it back or keep offloaded) so RAFT weights aren’t pinned in VRAM for the lifetime of the object.
450-492:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t materialize the whole video on the GPU.
Line 451transfers all frames todeviceup front even though the loop only needs two frames at a time. On longer clips that turns VOID into an O(T) VRAM spike and makes it much easier to OOM next to the diffusion model.Suggested direction
- frames = video_frames.to(device).permute(0, 3, 1, 2).to(torch.float32) / 255.0 + frames = video_frames.permute(0, 3, 1, 2) ... - prev = frames[0] + prev = frames[0].to(device, torch.float32) / 255.0 for i in range(1, T): - curr = frames[i] + curr = frames[i].to(device, torch.float32) / 255.0 flow = raft(prev, curr).to(device)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/void_noise_warp.py` around lines 450 - 492, The code currently moves the entire tensor video_frames to GPU at once (frames = video_frames.to(device)...), which OOMs for long clips; instead keep video_frames on CPU and only transfer/convert the two frames needed per iteration: remove the upfront .to(device) and global permute, compute H,W (and any resizing) using the first frame on CPU (apply F.interpolate per-frame if resize_frames != 1.0), then implement a small helper like process_frame(idx) that takes video_frames[idx], applies CPU-side resize and normalization (permute, float /255.0) and then .to(device) before using raft/warper; use process_frame(0) to create prev, and inside the loop call process_frame(i) for curr, so only two frames live on GPU at a time while leaving warper, raft usage and downscale logic (warper, NoiseWarper, downscale, raft, output assignment) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/latent_formats.py`:
- Around line 805-809: The docstring in comfy/latent_formats.py incorrectly
states that auto-selection of the CogVideoX models (referenced as
supported_models.CogVideoX_T2V) is based on "transformer hidden dim"; update
that wording to reflect the actual selection condition used in
comfy/supported_models.py (the selector uses num_attention_heads). Edit the
docstring to say the model is auto-selected based on num_attention_heads (or
mention that supported_models.CogVideoX_T2V selects by num_attention_heads) so
the documentation matches the implementation.
---
Duplicate comments:
In `@comfy_extras/void_noise_warp.py`:
- Around line 367-374: The _preprocess function can compute new_h or new_w == 0
(collapsing dimensions) and later RAFT/downscale allocation (where
downscale_factor and F.interpolate are used) can produce zero-sized tensors; add
guards to both: in _preprocess (and the RAFT/noise resolution allocation code
path) check that computed new_h and new_w are > 0 (and that image height/width
are >= downscale_factor) and raise a clear ValueError if not, before calling
_torch_resize_chw or F.interpolate/allocating outputs; include the failing image
size, computed new_h/new_w and downscale_factor in the error message to aid
debugging.
- Around line 388-391: When resizing the flow back to (h, w) you must also scale
the x/y displacement channels by the spatial resize ratios: before calling
_torch_resize_chw on flow (the tensor from list_of_flows[-1][0] with shape (2,
orig_h, orig_w)), save orig_h, orig_w = flow.shape[-2:], compute scale_x = w /
orig_w and scale_y = h / orig_h, then after the resize multiply flow[0,...] *=
scale_x and flow[1,...] *= scale_y (or the equivalent channel indices in your
flow convention) so the displacement magnitudes match the source resolution.
- Around line 357-365: The constructor for RaftOpticalFlow currently moves the
model onto device (model = model.to(device)) which pins weights in VRAM; remove
that call so the model stays on whatever device it was cached (do not call
.to(device) in __init__), keep model.eval() and store self.model = model and
self.device = device, and instead move the model to the execution device right
before/inside the inference routine (e.g., in the method that computes flow such
as compute_flow/forward/run_inference — call model.to(self.device) there, run
inference, then optionally move it back or keep offloaded) so RAFT weights
aren’t pinned in VRAM for the lifetime of the object.
- Around line 450-492: The code currently moves the entire tensor video_frames
to GPU at once (frames = video_frames.to(device)...), which OOMs for long clips;
instead keep video_frames on CPU and only transfer/convert the two frames needed
per iteration: remove the upfront .to(device) and global permute, compute H,W
(and any resizing) using the first frame on CPU (apply F.interpolate per-frame
if resize_frames != 1.0), then implement a small helper like process_frame(idx)
that takes video_frames[idx], applies CPU-side resize and normalization
(permute, float /255.0) and then .to(device) before using raft/warper; use
process_frame(0) to create prev, and inside the loop call process_frame(i) for
curr, so only two frames live on GPU at a time while leaving warper, raft usage
and downscale logic (warper, NoiseWarper, downscale, raft, output assignment)
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ee651dd4-4241-4126-af7d-f780c883120c
📒 Files selected for processing (8)
comfy/latent_formats.pycomfy/sd.pycomfy/supported_models.pycomfy/text_encoders/cogvideo.pycomfy_extras/nodes_void.pycomfy_extras/void_noise_warp.pyfolder_paths.pynodes.py
✅ Files skipped from review due to trivial changes (2)
- folder_paths.py
- comfy/text_encoders/cogvideo.py
🚧 Files skipped from review as they are similar to previous changes (3)
- comfy/sd.py
- nodes.py
- comfy_extras/nodes_void.py
| Covers THUDM/CogVideoX-5b, THUDM/CogVideoX-1.5-5B, and the CogVideoX-Fun | ||
| V1.5-5b family (including VOID inpainting). All of these have | ||
| scaling_factor=0.7 in their vae/config.json. Auto-selected in | ||
| supported_models.CogVideoX_T2V based on transformer hidden dim. | ||
| """ |
There was a problem hiding this comment.
Align selector wording with actual condition
Line 808 says auto-selection is based on transformer hidden dim, but the code path selects by num_attention_heads (see comfy/supported_models.py, Line 1860). Please align the docstring to avoid model-selection confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/latent_formats.py` around lines 805 - 809, The docstring in
comfy/latent_formats.py incorrectly states that auto-selection of the CogVideoX
models (referenced as supported_models.CogVideoX_T2V) is based on "transformer
hidden dim"; update that wording to reflect the actual selection condition used
in comfy/supported_models.py (the selector uses num_attention_heads). Edit the
docstring to say the model is auto-selected based on num_attention_heads (or
mention that supported_models.CogVideoX_T2V selects by num_attention_heads) so
the documentation matches the implementation.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
comfy_extras/void_noise_warp.py (2)
450-455:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the video on CPU and upload only
prev/curr.
video_frames.to(device)makes memory usage scale with the whole clip, even though the RAFT loop only consumes two frames at a time. On longer videos this directly eats into the diffusion model’s VRAM budget and makes VOID much easier to OOM.Also applies to: 486-492
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy_extras/void_noise_warp.py` around lines 450 - 455, The code currently moves the entire video tensor to device with frames = video_frames.to(device)... which consumes GPU memory; keep the full frames tensor on CPU (perform permute and normalization on CPU) and only .to(device) the two frames actually used by RAFT (prev/curr) inside the RAFT loop. Concretely: produce frames on CPU (e.g., frames = video_frames.permute(0,3,1,2).to(torch.float32)/255.0) and perform any resize (F.interpolate) on CPU as needed, then when you call RAFT-related functions use frames[idx].to(device) or frames[prev_idx].to(device) so only prev and curr are uploaded; also apply the same change for the other block referenced (lines ~486-492) where frames are currently moved to device. Ensure indexing uses the CPU-resident frames and only moves the selected frames to device before optical-flow/RAFT processing.
384-391:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRescale
dx/dyafter resizing the flow back to(H, W).RAFT predicts motion in the preprocessed multiple-of-8 coordinate system. This upsamples the field spatially, but it never scales the vector magnitudes, so non-multiple-of-8 inputs get systematically weaker warps.
Suggested fix
with torch.no_grad(): img1 = self._preprocess(from_image) img2 = self._preprocess(to_image) list_of_flows = self.model(img1, img2) flow = list_of_flows[-1][0] # (2, new_h, new_w) if flow.shape[-2:] != (h, w): + src_h, src_w = flow.shape[-2:] flow = _torch_resize_chw(flow, (h, w), "bilinear", copy=False) + flow[0] *= w / src_w + flow[1] *= h / src_h return flow🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy_extras/void_noise_warp.py` around lines 384 - 391, The flow magnitudes aren't being rescaled after spatially resizing the RAFT output, so dx/dy become incorrect for non-multiple-of-8 inputs; capture the original flow spatial shape (orig_h, orig_w) from list_of_flows[-1][0] before calling _torch_resize_chw, then after resizing the flow to (h, w) multiply the x-channel (flow[0]) by (w / orig_w) and the y-channel (flow[1]) by (h / orig_h) so the vector magnitudes are scaled from the preprocessed coordinate system to the final image coordinates; adjust this in the block that handles flow = list_of_flows[-1][0] and the subsequent resize branch that uses _torch_resize_chw.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@comfy_extras/void_noise_warp.py`:
- Around line 450-455: The code currently moves the entire video tensor to
device with frames = video_frames.to(device)... which consumes GPU memory; keep
the full frames tensor on CPU (perform permute and normalization on CPU) and
only .to(device) the two frames actually used by RAFT (prev/curr) inside the
RAFT loop. Concretely: produce frames on CPU (e.g., frames =
video_frames.permute(0,3,1,2).to(torch.float32)/255.0) and perform any resize
(F.interpolate) on CPU as needed, then when you call RAFT-related functions use
frames[idx].to(device) or frames[prev_idx].to(device) so only prev and curr are
uploaded; also apply the same change for the other block referenced (lines
~486-492) where frames are currently moved to device. Ensure indexing uses the
CPU-resident frames and only moves the selected frames to device before
optical-flow/RAFT processing.
- Around line 384-391: The flow magnitudes aren't being rescaled after spatially
resizing the RAFT output, so dx/dy become incorrect for non-multiple-of-8
inputs; capture the original flow spatial shape (orig_h, orig_w) from
list_of_flows[-1][0] before calling _torch_resize_chw, then after resizing the
flow to (h, w) multiply the x-channel (flow[0]) by (w / orig_w) and the
y-channel (flow[1]) by (h / orig_h) so the vector magnitudes are scaled from the
preprocessed coordinate system to the final image coordinates; adjust this in
the block that handles flow = list_of_flows[-1][0] and the subsequent resize
branch that uses _torch_resize_chw.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6b0770de-3edc-43c8-9db2-78284363b1c1
📒 Files selected for processing (8)
comfy/latent_formats.pycomfy/sd.pycomfy/supported_models.pycomfy/text_encoders/cogvideo.pycomfy_extras/nodes_void.pycomfy_extras/void_noise_warp.pyfolder_paths.pynodes.py
✅ Files skipped from review due to trivial changes (3)
- comfy/supported_models.py
- folder_paths.py
- comfy/latent_formats.py
🚧 Files skipped from review as they are similar to previous changes (3)
- comfy/sd.py
- comfy/text_encoders/cogvideo.py
- comfy_extras/nodes_void.py
This currently adds Pass 1 of void model: https://github.com/Netflix/void-model
Workflow:
Pass 1: void_inpainting.json
Pass 2: void_inpainting_pass2.json
Both together: void_inpainting_full.json
Model weights:
https://huggingface.co/netflix/void-model/resolve/main/void_pass1.safetensors
https://huggingface.co/netflix/void-model/resolve/main/void_pass2.safetensors
Conversion script to comfyUI format:
convert_void.py
To make it work with the given workflows, the converted model should go into
diffusion_modelsand be named:void_pass1_comfy.safetensorsandvoid_pass2_comfy.safetensorsHere are sample inputs:
https://github.com/Netflix/void-model/tree/main/sample
I was testing with the lime video. Visual outputs in Linear tickets CORE-48 and CORE-49