Skip to content

Void model - pass 1 & 2 (CORE-38)#13403

Merged
comfyanonymous merged 13 commits intomasterfrom
void-model_CORE-38
May 6, 2026
Merged

Void model - pass 1 & 2 (CORE-38)#13403
comfyanonymous merged 13 commits intomasterfrom
void-model_CORE-38

Conversation

@Talmaj
Copy link
Copy Markdown
Contributor

@Talmaj Talmaj commented Apr 14, 2026

This currently adds Pass 1 of void model: https://github.com/Netflix/void-model

Workflow:
Pass 1: void_inpainting.json
Pass 2: void_inpainting_pass2.json
Both together: void_inpainting_full.json

outdated: pass 1: void_inpainting.json
outdated: pass 2: void_inpainting_pass2.json
outdated: Both together: void_inpainting_full.json

Model weights:
https://huggingface.co/netflix/void-model/resolve/main/void_pass1.safetensors
https://huggingface.co/netflix/void-model/resolve/main/void_pass2.safetensors

Conversion script to comfyUI format:
convert_void.py

To make it work with the given workflows, the converted model should go into diffusion_models and be named: void_pass1_comfy.safetensors and void_pass2_comfy.safetensors

Here are sample inputs:
https://github.com/Netflix/void-model/tree/main/sample

I was testing with the lime video. Visual outputs in Linear tickets CORE-48 and CORE-49

@Talmaj Talmaj force-pushed the void-model_CORE-38 branch from 58cc799 to a0a69c9 Compare April 14, 2026 15:05
@Talmaj Talmaj changed the title Void model - pass 1 (CORE-38) Void model - pass 1 (CORE-48) Apr 14, 2026
Comment thread comfy_extras/nodes_void.py
@Talmaj Talmaj changed the title Void model - pass 1 (CORE-48) Void model - pass 1 & 2 (CORE-38) Apr 16, 2026
@Talmaj Talmaj force-pushed the void-model_CORE-38 branch from b68042b to 12809a0 Compare April 23, 2026 11:29
@Talmaj
Copy link
Copy Markdown
Contributor Author

Talmaj commented Apr 27, 2026

I've removed the rp dependency for computation of optical flow.

Base automatically changed from cogvideox to master April 29, 2026 23:30
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Important

Review skipped

Review was skipped as selected files did not have any reviewable changes.

💤 Files selected but had no reviewable changes (1)
  • models/optical_flow/put_optical_flow_models_here
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 734bab60-cdb8-4b57-9e57-16d8c81e722c

📥 Commits

Reviewing files that changed from the base of the PR and between 0fca6d7 and 5e287b9.

📒 Files selected for processing (1)
  • models/optical_flow/put_optical_flow_models_here

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds CogVideoX support: expands CogVideoX docs and adds CogVideoX1_5 (sets scale_factor = 0.7). Introduces CLIPType.COGVIDEOX, CogVideoX tokenizer/encoder wrappers and cogvideo_te factory. CogVideoX_T2V now may auto-select CogVideoX1_5 for 5B-class U‑Net configs and a new CogVideoX_Inpaint model variant is added. Implements a VOID inpainting extension (comfy_extras/nodes_void.py) with RAFT optical-flow, warped-noise generation (void_noise_warp.py), a VOID DDIM sampler, adds an optical_flow folder mapping, and a cogvideox CLIPLoader preset.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Void model - pass 1 & 2 (CORE-38)' directly references the Netflix VOID model implementation being added, which is the primary focus of this changeset.
Description check ✅ Passed The description provides relevant context about the VOID model integration, including workflow files, model weights, conversion scripts, and testing results that directly relate to the changes being made.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_void.py`:
- Around line 155-162: The quadmask resizing currently uses
comfy.utils.common_upscale with "bilinear", which creates interpolated values
and destroys the discrete 4-class mask; change the upscale call in the quadmask
flow (variable qm) to use nearest-neighbor sampling ("nearest-exact") or, if you
must keep bilinear, immediately re-quantize qm back to the original 4 discrete
classes before it is used to build masked_video or encoded into the mask latent;
update the comfy.utils.common_upscale invocation that moves dims on qm (and any
subsequent qm processing) to apply the nearest-exact mode or a quantization step
to restore class indices.
- Around line 295-296: The current use of warped_tensor.repeat(batch_size, 1, 1,
1, 1) duplicates the identical warped-noise across the batch (in function/method
where warped_tensor and batch_size are used), producing identical outputs per
item; change this by either (A) generating distinct per-batch warped
tensors/noise before repeating so each batch index has unique noise (replace the
single warped_tensor with a batch-sized tensor of independent warps), or (B) if
per-batch noise is not implemented yet, cap the effective batch replication to 1
(e.g., use effective_batch = min(batch_size, 1) and only repeat when
effective_batch > 1) and remove or disable the batch_size input until unique
per-batch noise is implemented; update any references to warped_tensor.repeat
accordingly.
- Around line 21-38: The helper _valid_void_length currently maps any input <5
up to 5, breaking the “round down” contract; change it to explicitly reject
unsupported very-short clips by validating the input against a minimum allowed
frame count (e.g. 5) using the same constants TEMPORAL_COMPRESSION and
PATCH_SIZE_T, and raise a clear exception (ValueError) when length < minimum;
also update the other affected spots (the similar logic around the blocks
referenced at lines 119-123 and 228-230) or the input schemas so they enforce
the same minimum rather than allowing silent expansion.
- Around line 219-239: VOIDWarpedNoise currently produces non-reproducible noise
because no seed is threaded through; update the node to accept a seed input in
define_schema and propagate that seed into execute so it is passed to
get_noise_from_video (in comfy_extras/void_noise_warp.py) instead of using
hard-coded seed=0, and modify Noise_FromLatent usage to initialize/clone with
the provided seed rather than 0; ensure all other occurrences mentioned (around
lines 270-277 and 302-309) similarly accept and forward the seed parameter to
maintain deterministic sampling across Pass 2.

In `@comfy_extras/void_noise_warp.py`:
- Around line 355-358: The code currently caches a RAFT model instance already
moved to device (see raft_large and self.model in _get_raft_model), which pins
CUDA memory; instead ensure the cached object is CPU-resident or cache only a
CPU state (e.g., state_dict) and perform the .to(device) only when
returning/using the model. Update _get_raft_model so the cached model is kept on
CPU (or store its state_dict) and any call that needs the model moves it to the
requested device (and moves it back or discards the device copy after use) to
avoid permanently pinning VRAM; apply the same change to the other cache site
around lines 390-394 that likewise sets model.to(device).
- Around line 452-456: The code currently moves and converts the entire
video_frames tensor to device and permutes it into frames, causing O(T) VRAM;
instead, keep video_frames on CPU and only transfer/convert the two frames
needed for each RAFT pass (prev and curr). Change logic around the variable
frames/video_frames so that you index two frames at a time from video_frames on
CPU, permute/to(torch.float32)/div255 and move each selected frame to device
before resizing with F.interpolate, use max(1, ...) for new_h/new_w as before,
and release or detach the device tensors after the RAFT step; apply the same
per-frame transfer/resizing pattern for the other block referenced (lines around
489-495).
- Around line 363-365: Validate input height/width before the resize/downscale
path: ensure h and w (and the computed new_h = (h // 8) * 8 and new_w = (w // 8)
* 8) are not zero and clamp them to a safe minimum (e.g., at least 1, or to 8 if
you require multiples of 8) before calling _torch_resize_chw; if clamping would
change aspect or is invalid, raise a clear ValueError describing the tiny input
so callers can handle it. Apply the same guard/clamp logic in the other
resize/downscale blocks (the regions around the downscale() usage and the blocks
mentioned at lines 461-478 and 480-485) so RAFT preprocessing that floors a
dimension to 0 cannot lead to a zero-sized tensor being passed into
_torch_resize_chw or downscale().
- Line 365: After upsampling the flow field spatially with _torch_resize_chw
(the same call used for image at "image = _torch_resize_chw(image, (new_h,
new_w), ...)" ), rescale the flow vectors so their magnitudes match the original
resolution by multiplying the x (horizontal) channel by (orig_width / new_w) and
the y (vertical) channel by (orig_height / new_h); apply this correction right
after the resize that returns flow to source size (and repeat the same fix in
the other block around lines 381-384) while preserving tensor dtype/device.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa0363b5-b5b4-4c26-bb4a-57fbe9c35a7c

📥 Commits

Reviewing files that changed from the base of the PR and between a164c82 and d71133f.

📒 Files selected for processing (7)
  • comfy/latent_formats.py
  • comfy/sd.py
  • comfy/supported_models.py
  • comfy/text_encoders/cogvideo.py
  • comfy_extras/nodes_void.py
  • comfy_extras/void_noise_warp.py
  • nodes.py

Comment on lines +21 to +38
def _valid_void_length(length: int) -> int:
"""Round ``length`` down to a value that produces an even latent_t.

VOID / CogVideoX-Fun-V1.5 uses patch_size_t=2, so the VAE-encoded latent
must have an even temporal dimension. If latent_t is odd, the transformer
pad_to_patch_size circular-wraps an extra latent frame onto the end; after
the post-transformer crop the last real latent frame has been influenced
by the wrapped phantom frame, producing visible jitter and "disappearing"
subjects near the end of the decoded video. Rounding down fixes this.
"""
latent_t = ((length - 1) // TEMPORAL_COMPRESSION) + 1
if latent_t % PATCH_SIZE_T == 0:
return length
# Round latent_t down to the nearest multiple of PATCH_SIZE_T, then invert
# the ((length - 1) // TEMPORAL_COMPRESSION) + 1 formula. Floor at 1 frame
# so we never return a non-positive length.
target_latent_t = max(PATCH_SIZE_T, (latent_t // PATCH_SIZE_T) * PATCH_SIZE_T)
return (target_latent_t - 1) * TEMPORAL_COMPRESSION + 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Lengths 1..4 are silently expanded to five frames.

_valid_void_length() returns 5 for any input below 5, so the UI contract to “round down” is false for the lowest allowed values. The downstream padding/resampling paths then synthesize extra timesteps from repeated last-frame data instead of honoring the requested clip length. Please either reject <5 in both schemas or make the helper fail explicitly for unsupported short clips.

Also applies to: 119-123, 228-230

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_void.py` around lines 21 - 38, The helper
_valid_void_length currently maps any input <5 up to 5, breaking the “round
down” contract; change it to explicitly reject unsupported very-short clips by
validating the input against a minimum allowed frame count (e.g. 5) using the
same constants TEMPORAL_COMPRESSION and PATCH_SIZE_T, and raise a clear
exception (ValueError) when length < minimum; also update the other affected
spots (the similar logic around the blocks referenced at lines 119-123 and
228-230) or the input schemas so they enforce the same minimum rather than
allowing silent expansion.

Comment on lines +155 to +162
qm = quadmask[:length]
if qm.ndim == 3:
qm = qm.unsqueeze(-1)
qm = comfy.utils.common_upscale(
qm.movedim(-1, 1), width, height, "bilinear", "center"
).movedim(1, -1)
if qm.ndim == 4 and qm.shape[-1] == 1:
qm = qm.squeeze(-1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the quadmask’s discrete classes when resizing.

This is a 4-level semantic mask, but resizing it with "bilinear" invents intermediate values before you build masked_video and encode the mask latent. That blurs the class boundaries the preprocess node just quantized. Use "nearest-exact" here, or re-quantize after the resize.

Suggested fix
         qm = comfy.utils.common_upscale(
-            qm.movedim(-1, 1), width, height, "bilinear", "center"
+            qm.movedim(-1, 1), width, height, "nearest-exact", "center"
         ).movedim(1, -1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
qm = quadmask[:length]
if qm.ndim == 3:
qm = qm.unsqueeze(-1)
qm = comfy.utils.common_upscale(
qm.movedim(-1, 1), width, height, "bilinear", "center"
).movedim(1, -1)
if qm.ndim == 4 and qm.shape[-1] == 1:
qm = qm.squeeze(-1)
qm = quadmask[:length]
if qm.ndim == 3:
qm = qm.unsqueeze(-1)
qm = comfy.utils.common_upscale(
qm.movedim(-1, 1), width, height, "nearest-exact", "center"
).movedim(1, -1)
if qm.ndim == 4 and qm.shape[-1] == 1:
qm = qm.squeeze(-1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_void.py` around lines 155 - 162, The quadmask resizing
currently uses comfy.utils.common_upscale with "bilinear", which creates
interpolated values and destroys the discrete 4-class mask; change the upscale
call in the quadmask flow (variable qm) to use nearest-neighbor sampling
("nearest-exact") or, if you must keep bilinear, immediately re-quantize qm back
to the original 4 discrete classes before it is used to build masked_video or
encoded into the mask latent; update the comfy.utils.common_upscale invocation
that moves dims on qm (and any subsequent qm processing) to apply the
nearest-exact mode or a quantization step to restore class indices.

Comment thread comfy_extras/nodes_void.py Outdated
Comment on lines +219 to +239
@classmethod
def define_schema(cls):
return io.Schema(
node_id="VOIDWarpedNoise",
category="latent/video",
inputs=[
io.Image.Input("video", tooltip="Pass 1 output video frames [T, H, W, 3]"),
io.Int.Input("width", default=672, min=16, max=nodes.MAX_RESOLUTION, step=8),
io.Int.Input("height", default=384, min=16, max=nodes.MAX_RESOLUTION, step=8),
io.Int.Input("length", default=45, min=1, max=nodes.MAX_RESOLUTION, step=1,
tooltip="Number of pixel frames. Rounded down to make latent_t "
"even (patch_size_t=2 requirement), e.g. 49 → 45."),
io.Int.Input("batch_size", default=1, min=1, max=64),
],
outputs=[
io.Latent.Output(display_name="warped_noise"),
],
)

@classmethod
def execute(cls, video, width, height, length, batch_size) -> io.NodeOutput:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Thread a seed through warped-noise generation.

VOIDWarpedNoise currently bakes stochastic noise without any seed input, comfy_extras/void_noise_warp.py:401-425 shows get_noise_from_video(...) has no seed parameter, and Noise_FromLatent hard-codes seed = 0 while just cloning the precomputed tensor. That means a fixed sampler seed cannot reproduce Pass 2 results once this node is in the graph.

Also applies to: 270-277, 302-309

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_void.py` around lines 219 - 239, VOIDWarpedNoise currently
produces non-reproducible noise because no seed is threaded through; update the
node to accept a seed input in define_schema and propagate that seed into
execute so it is passed to get_noise_from_video (in
comfy_extras/void_noise_warp.py) instead of using hard-coded seed=0, and modify
Noise_FromLatent usage to initialize/clone with the provided seed rather than 0;
ensure all other occurrences mentioned (around lines 270-277 and 302-309)
similarly accept and forward the seed parameter to maintain deterministic
sampling across Pass 2.

Comment on lines +295 to +296
if batch_size > 1:
warped_tensor = warped_tensor.repeat(batch_size, 1, 1, 1, 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

batch_size > 1 currently duplicates the same sample.

Repeating the exact same warped-noise tensor across the batch makes every item follow the same deterministic trajectory, so this node’s batch_size knob produces duplicate videos rather than multiple candidates. If distinct per-batch noise is not implemented yet, it would be safer to cap this at 1 or remove the input for now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_void.py` around lines 295 - 296, The current use of
warped_tensor.repeat(batch_size, 1, 1, 1, 1) duplicates the identical
warped-noise across the batch (in function/method where warped_tensor and
batch_size are used), producing identical outputs per item; change this by
either (A) generating distinct per-batch warped tensors/noise before repeating
so each batch index has unique noise (replace the single warped_tensor with a
batch-sized tensor of independent warps), or (B) if per-batch noise is not
implemented yet, cap the effective batch replication to 1 (e.g., use
effective_batch = min(batch_size, 1) and only repeat when effective_batch > 1)
and remove or disable the batch_size input until unique per-batch noise is
implemented; update any references to warped_tensor.repeat accordingly.

Comment thread comfy_extras/void_noise_warp.py Outdated
Comment on lines +363 to +365
new_h = (h // 8) * 8
new_w = (w // 8) * 8
image = _torch_resize_chw(image, (new_h, new_w), "bilinear", copy=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate tiny inputs before the resize/downscale path.

Small or aggressively downscaled clips can still crash here: the RAFT preprocess can floor a dimension to 0, and the output buffer uses floor division while downscale() clamps to at least 1x1. A clear upfront validation would avoid shape mismatches and interpolate errors.

Also applies to: 461-478, 480-485

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/void_noise_warp.py` around lines 363 - 365, Validate input
height/width before the resize/downscale path: ensure h and w (and the computed
new_h = (h // 8) * 8 and new_w = (w // 8) * 8) are not zero and clamp them to a
safe minimum (e.g., at least 1, or to 8 if you require multiples of 8) before
calling _torch_resize_chw; if clamping would change aspect or is invalid, raise
a clear ValueError describing the tiny input so callers can handle it. Apply the
same guard/clamp logic in the other resize/downscale blocks (the regions around
the downscale() usage and the blocks mentioned at lines 461-478 and 480-485) so
RAFT preprocessing that floors a dimension to 0 cannot lead to a zero-sized
tensor being passed into _torch_resize_chw or downscale().

Comment thread comfy_extras/void_noise_warp.py
Comment on lines +452 to +456
frames = video_frames.to(device).permute(0, 3, 1, 2).to(torch.float32) / 255.0
if resize_frames != 1.0:
new_h = max(1, int(frames.shape[2] * resize_frames))
new_w = max(1, int(frames.shape[3] * resize_frames))
frames = F.interpolate(frames, size=(new_h, new_w), mode="area")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t upload the full clip to the execution device up front.

This converts and stores every frame on device, but the loop only ever needs prev and curr. On longer videos that turns a two-frame RAFT pass into O(T) VRAM usage and makes VOID much easier to OOM next to the diffusion model.

Also applies to: 489-495

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/void_noise_warp.py` around lines 452 - 456, The code currently
moves and converts the entire video_frames tensor to device and permutes it into
frames, causing O(T) VRAM; instead, keep video_frames on CPU and only
transfer/convert the two frames needed for each RAFT pass (prev and curr).
Change logic around the variable frames/video_frames so that you index two
frames at a time from video_frames on CPU, permute/to(torch.float32)/div255 and
move each selected frame to device before resizing with F.interpolate, use
max(1, ...) for new_h/new_w as before, and release or detach the device tensors
after the RAFT step; apply the same per-frame transfer/resizing pattern for the
other block referenced (lines around 489-495).

Comment thread comfy_extras/void_noise_warp.py Outdated
@Talmaj Talmaj force-pushed the void-model_CORE-38 branch 2 times, most recently from 04181ef to d56887a Compare May 1, 2026 19:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
comfy_extras/void_noise_warp.py (4)

367-374: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate tiny clips before RAFT preprocessing / downscale allocation.

Line 370-372 can collapse either dimension to 0, and Line 471-474 can also allocate a zero-sized output when the internal noise resolution is smaller than downscale_factor. In both cases the first F.interpolate or assignment will fail instead of raising a clear error.

Suggested direction
     def _preprocess(self, image_chw):
         image = image_chw.to(self.device, torch.float32)
         _, h, w = image.shape
+        if h < 8 or w < 8:
+            raise ValueError(
+                f"RaftOpticalFlow: input too small for 8x preprocessing, got {h}x{w}"
+            )
         new_h = (h // 8) * 8
         new_w = (w // 8) * 8

Also applies to: 471-482

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/void_noise_warp.py` around lines 367 - 374, The _preprocess
function can compute new_h or new_w == 0 (collapsing dimensions) and later
RAFT/downscale allocation (where downscale_factor and F.interpolate are used)
can produce zero-sized tensors; add guards to both: in _preprocess (and the
RAFT/noise resolution allocation code path) check that computed new_h and new_w
are > 0 (and that image height/width are >= downscale_factor) and raise a clear
ValueError if not, before calling _torch_resize_chw or F.interpolate/allocating
outputs; include the failing image size, computed new_h/new_w and
downscale_factor in the error message to aid debugging.

388-391: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scale the flow vectors after resizing them back to source size.

Line 390 only upsamples the flow field spatially; the x/y displacements still need to be multiplied by the resize ratios. Without that correction, the warp is too weak whenever RAFT output is resized back to (h, w).

Suggested fix
             list_of_flows = self.model(img1, img2)
             flow = list_of_flows[-1][0]  # (2, new_h, new_w)
             if flow.shape[-2:] != (h, w):
+                src_h, src_w = flow.shape[-2:]
                 flow = _torch_resize_chw(flow, (h, w), "bilinear", copy=False)
+                flow[0] *= w / src_w
+                flow[1] *= h / src_h
         return flow
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/void_noise_warp.py` around lines 388 - 391, When resizing the
flow back to (h, w) you must also scale the x/y displacement channels by the
spatial resize ratios: before calling _torch_resize_chw on flow (the tensor from
list_of_flows[-1][0] with shape (2, orig_h, orig_w)), save orig_h, orig_w =
flow.shape[-2:], compute scale_x = w / orig_w and scale_y = h / orig_h, then
after the resize multiply flow[0,...] *= scale_x and flow[1,...] *= scale_y (or
the equivalent channel indices in your flow convention) so the displacement
magnitudes match the source resolution.

357-365: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the RAFT weights off the execution device until inference.

Line 362 moves the cached model onto device during construction, so every RaftOpticalFlow instance pins its weights in VRAM for the rest of the process. That defeats ComfyUI’s offload behavior and raises memory pressure for later nodes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/void_noise_warp.py` around lines 357 - 365, The constructor for
RaftOpticalFlow currently moves the model onto device (model = model.to(device))
which pins weights in VRAM; remove that call so the model stays on whatever
device it was cached (do not call .to(device) in __init__), keep model.eval()
and store self.model = model and self.device = device, and instead move the
model to the execution device right before/inside the inference routine (e.g.,
in the method that computes flow such as compute_flow/forward/run_inference —
call model.to(self.device) there, run inference, then optionally move it back or
keep offloaded) so RAFT weights aren’t pinned in VRAM for the lifetime of the
object.

450-492: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t materialize the whole video on the GPU.

Line 451 transfers all frames to device up front even though the loop only needs two frames at a time. On longer clips that turns VOID into an O(T) VRAM spike and makes it much easier to OOM next to the diffusion model.

Suggested direction
-    frames = video_frames.to(device).permute(0, 3, 1, 2).to(torch.float32) / 255.0
+    frames = video_frames.permute(0, 3, 1, 2)
...
-        prev = frames[0]
+        prev = frames[0].to(device, torch.float32) / 255.0
         for i in range(1, T):
-            curr = frames[i]
+            curr = frames[i].to(device, torch.float32) / 255.0
             flow = raft(prev, curr).to(device)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/void_noise_warp.py` around lines 450 - 492, The code currently
moves the entire tensor video_frames to GPU at once (frames =
video_frames.to(device)...), which OOMs for long clips; instead keep
video_frames on CPU and only transfer/convert the two frames needed per
iteration: remove the upfront .to(device) and global permute, compute H,W (and
any resizing) using the first frame on CPU (apply F.interpolate per-frame if
resize_frames != 1.0), then implement a small helper like process_frame(idx)
that takes video_frames[idx], applies CPU-side resize and normalization
(permute, float /255.0) and then .to(device) before using raft/warper; use
process_frame(0) to create prev, and inside the loop call process_frame(i) for
curr, so only two frames live on GPU at a time while leaving warper, raft usage
and downscale logic (warper, NoiseWarper, downscale, raft, output assignment)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/latent_formats.py`:
- Around line 805-809: The docstring in comfy/latent_formats.py incorrectly
states that auto-selection of the CogVideoX models (referenced as
supported_models.CogVideoX_T2V) is based on "transformer hidden dim"; update
that wording to reflect the actual selection condition used in
comfy/supported_models.py (the selector uses num_attention_heads). Edit the
docstring to say the model is auto-selected based on num_attention_heads (or
mention that supported_models.CogVideoX_T2V selects by num_attention_heads) so
the documentation matches the implementation.

---

Duplicate comments:
In `@comfy_extras/void_noise_warp.py`:
- Around line 367-374: The _preprocess function can compute new_h or new_w == 0
(collapsing dimensions) and later RAFT/downscale allocation (where
downscale_factor and F.interpolate are used) can produce zero-sized tensors; add
guards to both: in _preprocess (and the RAFT/noise resolution allocation code
path) check that computed new_h and new_w are > 0 (and that image height/width
are >= downscale_factor) and raise a clear ValueError if not, before calling
_torch_resize_chw or F.interpolate/allocating outputs; include the failing image
size, computed new_h/new_w and downscale_factor in the error message to aid
debugging.
- Around line 388-391: When resizing the flow back to (h, w) you must also scale
the x/y displacement channels by the spatial resize ratios: before calling
_torch_resize_chw on flow (the tensor from list_of_flows[-1][0] with shape (2,
orig_h, orig_w)), save orig_h, orig_w = flow.shape[-2:], compute scale_x = w /
orig_w and scale_y = h / orig_h, then after the resize multiply flow[0,...] *=
scale_x and flow[1,...] *= scale_y (or the equivalent channel indices in your
flow convention) so the displacement magnitudes match the source resolution.
- Around line 357-365: The constructor for RaftOpticalFlow currently moves the
model onto device (model = model.to(device)) which pins weights in VRAM; remove
that call so the model stays on whatever device it was cached (do not call
.to(device) in __init__), keep model.eval() and store self.model = model and
self.device = device, and instead move the model to the execution device right
before/inside the inference routine (e.g., in the method that computes flow such
as compute_flow/forward/run_inference — call model.to(self.device) there, run
inference, then optionally move it back or keep offloaded) so RAFT weights
aren’t pinned in VRAM for the lifetime of the object.
- Around line 450-492: The code currently moves the entire tensor video_frames
to GPU at once (frames = video_frames.to(device)...), which OOMs for long clips;
instead keep video_frames on CPU and only transfer/convert the two frames needed
per iteration: remove the upfront .to(device) and global permute, compute H,W
(and any resizing) using the first frame on CPU (apply F.interpolate per-frame
if resize_frames != 1.0), then implement a small helper like process_frame(idx)
that takes video_frames[idx], applies CPU-side resize and normalization
(permute, float /255.0) and then .to(device) before using raft/warper; use
process_frame(0) to create prev, and inside the loop call process_frame(i) for
curr, so only two frames live on GPU at a time while leaving warper, raft usage
and downscale logic (warper, NoiseWarper, downscale, raft, output assignment)
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ee651dd4-4241-4126-af7d-f780c883120c

📥 Commits

Reviewing files that changed from the base of the PR and between 04181ef and d56887a.

📒 Files selected for processing (8)
  • comfy/latent_formats.py
  • comfy/sd.py
  • comfy/supported_models.py
  • comfy/text_encoders/cogvideo.py
  • comfy_extras/nodes_void.py
  • comfy_extras/void_noise_warp.py
  • folder_paths.py
  • nodes.py
✅ Files skipped from review due to trivial changes (2)
  • folder_paths.py
  • comfy/text_encoders/cogvideo.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • comfy/sd.py
  • nodes.py
  • comfy_extras/nodes_void.py

Comment thread comfy/latent_formats.py
Comment on lines +805 to +809
Covers THUDM/CogVideoX-5b, THUDM/CogVideoX-1.5-5B, and the CogVideoX-Fun
V1.5-5b family (including VOID inpainting). All of these have
scaling_factor=0.7 in their vae/config.json. Auto-selected in
supported_models.CogVideoX_T2V based on transformer hidden dim.
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align selector wording with actual condition

Line 808 says auto-selection is based on transformer hidden dim, but the code path selects by num_attention_heads (see comfy/supported_models.py, Line 1860). Please align the docstring to avoid model-selection confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/latent_formats.py` around lines 805 - 809, The docstring in
comfy/latent_formats.py incorrectly states that auto-selection of the CogVideoX
models (referenced as supported_models.CogVideoX_T2V) is based on "transformer
hidden dim"; update that wording to reflect the actual selection condition used
in comfy/supported_models.py (the selector uses num_attention_heads). Edit the
docstring to say the model is auto-selected based on num_attention_heads (or
mention that supported_models.CogVideoX_T2V selects by num_attention_heads) so
the documentation matches the implementation.

@Talmaj Talmaj force-pushed the void-model_CORE-38 branch from d56887a to 0fca6d7 Compare May 4, 2026 18:22
@Talmaj Talmaj requested a review from kijai as a code owner May 4, 2026 18:22
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
comfy_extras/void_noise_warp.py (2)

450-455: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the video on CPU and upload only prev/curr.

video_frames.to(device) makes memory usage scale with the whole clip, even though the RAFT loop only consumes two frames at a time. On longer videos this directly eats into the diffusion model’s VRAM budget and makes VOID much easier to OOM.

Also applies to: 486-492

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy_extras/void_noise_warp.py` around lines 450 - 455, The code currently
moves the entire video tensor to device with frames = video_frames.to(device)...
which consumes GPU memory; keep the full frames tensor on CPU (perform permute
and normalization on CPU) and only .to(device) the two frames actually used by
RAFT (prev/curr) inside the RAFT loop. Concretely: produce frames on CPU (e.g.,
frames = video_frames.permute(0,3,1,2).to(torch.float32)/255.0) and perform any
resize (F.interpolate) on CPU as needed, then when you call RAFT-related
functions use frames[idx].to(device) or frames[prev_idx].to(device) so only prev
and curr are uploaded; also apply the same change for the other block referenced
(lines ~486-492) where frames are currently moved to device. Ensure indexing
uses the CPU-resident frames and only moves the selected frames to device before
optical-flow/RAFT processing.

384-391: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rescale dx/dy after resizing the flow back to (H, W).

RAFT predicts motion in the preprocessed multiple-of-8 coordinate system. This upsamples the field spatially, but it never scales the vector magnitudes, so non-multiple-of-8 inputs get systematically weaker warps.

Suggested fix
         with torch.no_grad():
             img1 = self._preprocess(from_image)
             img2 = self._preprocess(to_image)
             list_of_flows = self.model(img1, img2)
             flow = list_of_flows[-1][0]  # (2, new_h, new_w)
             if flow.shape[-2:] != (h, w):
+                src_h, src_w = flow.shape[-2:]
                 flow = _torch_resize_chw(flow, (h, w), "bilinear", copy=False)
+                flow[0] *= w / src_w
+                flow[1] *= h / src_h
         return flow
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy_extras/void_noise_warp.py` around lines 384 - 391, The flow magnitudes
aren't being rescaled after spatially resizing the RAFT output, so dx/dy become
incorrect for non-multiple-of-8 inputs; capture the original flow spatial shape
(orig_h, orig_w) from list_of_flows[-1][0] before calling _torch_resize_chw,
then after resizing the flow to (h, w) multiply the x-channel (flow[0]) by (w /
orig_w) and the y-channel (flow[1]) by (h / orig_h) so the vector magnitudes are
scaled from the preprocessed coordinate system to the final image coordinates;
adjust this in the block that handles flow = list_of_flows[-1][0] and the
subsequent resize branch that uses _torch_resize_chw.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@comfy_extras/void_noise_warp.py`:
- Around line 450-455: The code currently moves the entire video tensor to
device with frames = video_frames.to(device)... which consumes GPU memory; keep
the full frames tensor on CPU (perform permute and normalization on CPU) and
only .to(device) the two frames actually used by RAFT (prev/curr) inside the
RAFT loop. Concretely: produce frames on CPU (e.g., frames =
video_frames.permute(0,3,1,2).to(torch.float32)/255.0) and perform any resize
(F.interpolate) on CPU as needed, then when you call RAFT-related functions use
frames[idx].to(device) or frames[prev_idx].to(device) so only prev and curr are
uploaded; also apply the same change for the other block referenced (lines
~486-492) where frames are currently moved to device. Ensure indexing uses the
CPU-resident frames and only moves the selected frames to device before
optical-flow/RAFT processing.
- Around line 384-391: The flow magnitudes aren't being rescaled after spatially
resizing the RAFT output, so dx/dy become incorrect for non-multiple-of-8
inputs; capture the original flow spatial shape (orig_h, orig_w) from
list_of_flows[-1][0] before calling _torch_resize_chw, then after resizing the
flow to (h, w) multiply the x-channel (flow[0]) by (w / orig_w) and the
y-channel (flow[1]) by (h / orig_h) so the vector magnitudes are scaled from the
preprocessed coordinate system to the final image coordinates; adjust this in
the block that handles flow = list_of_flows[-1][0] and the subsequent resize
branch that uses _torch_resize_chw.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b0770de-3edc-43c8-9db2-78284363b1c1

📥 Commits

Reviewing files that changed from the base of the PR and between d56887a and 0fca6d7.

📒 Files selected for processing (8)
  • comfy/latent_formats.py
  • comfy/sd.py
  • comfy/supported_models.py
  • comfy/text_encoders/cogvideo.py
  • comfy_extras/nodes_void.py
  • comfy_extras/void_noise_warp.py
  • folder_paths.py
  • nodes.py
✅ Files skipped from review due to trivial changes (3)
  • comfy/supported_models.py
  • folder_paths.py
  • comfy/latent_formats.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • comfy/sd.py
  • comfy/text_encoders/cogvideo.py
  • comfy_extras/nodes_void.py

Comment thread folder_paths.py
@comfyanonymous comfyanonymous merged commit 78b3096 into master May 6, 2026
16 checks passed
@comfyanonymous comfyanonymous deleted the void-model_CORE-38 branch May 6, 2026 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants