feat: Video generation by lstein · Pull Request #9163 · invoke-ai/InvokeAI

lstein · 2026-05-13T04:03:53Z

Summary

This PR adds basic support for AI-based video generation using the Wan 2.2 family of text-to-video and image-to-video models. Currently it can only be used through the workflow editor.

invoke-videos.mp4

This PR adds the following support:

A new Video type, with support for upload, thumbnail generation, and metadata storage.
New movie player functionality in the viewer.
A Wan Denoise Video node for denoising Wan 5D tensors into latents
A Latents To Video node for decoding the denoiser latents
A few image nodes for selecting frames from videos and concatenating videos together
Template workflows that show typical text-to-video and image-to-video graphs.

Important notes

It is dependent on Wan 2.2 stack: txt2img, img2img, inpaint, outpaint, LoRA support #9158, which must be merged before this PR. However the PR can be tested standalone.
It triggers a database migration, so run this in a test root.

Testing plan

Smoke-test T2V Lightning (832×480, 81 frames, 4 steps, CFG 1.0)
Smoke-test I2V Lightning with a 16:9 source image
Verify auto-switch to new video on completion, full-resolution first-frame preview, inline playback
Upload an MP4 from disk; verify it appears in the gallery and plays
Drag a gallery video onto a Video Primitive; verify the field populates
Exercise the video concatenate node by joining two videos using the various transition options
Run on Windows / macOS to confirm the imageio FFmpeg path works on those platforms

Getting Started Hints

Run uv pip install to pick up the new dependency on the ffmpeg library, and of course rebuild the front end.

Install the following from the starter model collection:

Wan 2.2 TI2V-5B (Q4_K_M) -- a very small, low-quality video generator suitable for 12 GB VRAM or less
Wan 2.2 T2V A14B High Noise (Q4_K_M) - text-to-video transformer, rough phase
Wan 2.2 T2V A14B Low Noise (Q4_K_M) - text-to-video transformer, refiner phase
Wan 2.2 T2V Lightning High Noise (4-step, V1.1) - turbo LoRA, rough phase
Wan 2.2 T2V Lightning Low Noise (4-step, V1.1) - turbo LoRA, refiner phase

The corresponding models for image-to-video are:

Wan 2.2 I2V A14B High Noise (Q4_K_M)
Wan 2.2 I2V A14B Low Noise (Q4_K_M)
Wan 2.2 I2V Lightning High Noise (4-step, V1.1)
Wan 2.2 I2V Lightning Low Noise (4-step, V1.1)

The encoder and VAE should download as dependencies.

Give it a spin! There are several working templates in the Workflow library that you can start with:

Text to Video - Wan 2.2 -- basic text to video. Select the TI2V-5B model if you are low on VRAM and leave "Transformer (Low Noise) empty. If you have >=16 GB, you can use the high-quality A14B models, apply the high noise transformer model to the Main Model "Transformer" field, and the low noise transformer model to the "Transformer (Low Noise)" field. Select the models you downloaded for the standalone VAE and T5 Encoder fields. Type in a prompt, and use the default values for CFG, image dimensions, frames and steps.
Text to Video - Wan 2.2 Lightning -- This is the same as above, but has loaders for the high and low noise lightning LoRAs which will reduce the number of required steps to 4.
Image to Video - Wan 2.2 -- basic reference image to video. You need to provide a reference image and a prompt describing what to do with it. The reference image will be the first frame of the image. The image should have the same aspect ratio as the desired video, but doesn't need to be exactly the same dimensions.
Image to Video - Wan 2.2 Lightning -- as above, but with nodes for the two lightning LoRAs for dramatic speedups.

The models work best across a limited series of dimensions. The ones I've tested are:

720 x 480
480 x 720
832 x 480
480 x 832

On my RTX 5060Ti it takes 3-4 minutes to generate a video when using the two Lightning LoRAs.

If you have lots of VRAM you can try increasing the frame count, but these videos get big fast. Alternatively, you can create a workflow that captures the last frame of the original video, generates a new video on top of it, and concatenates the two together. I've got a workflow that works well for this, but haven't added it to the branch yet.

🤖 Generated with help from Claude Code

Foundation + TI2V-5B MVP + A14B dual-expert MoE for Wan 2.2 image generation. Wan was trained on video but is competitive with leading open-source image models when run at num_frames=1; this commit wires that path into InvokeAI. Phase 0 — Foundation: - BaseModelType.Wan + WanVariantType {T2V_A14B, TI2V_5B} - SubModelType.Transformer2 for the dual-expert MoE - MainModelDefaultSettings per variant - step_callback Wan branch (16-channel preview; 48-channel TI2V-5B falls back to slicing first 16 channels until proper factors land) - Frontend enums + node colour Phase 1 — TI2V-5B Diffusers MVP: - Main_Diffusers_Wan_Config probe (variant from transformer_2/ + vae/config.json::z_dim, with filename heuristic fallback) - WanDiffusersModel loader (subclasses GenericDiffusersLoader) - WanT5EncoderField, WanTransformerField (with dual-expert slots), WanConditioningField, WanConditioningInfo - New invocations: wan_model_loader, wan_text_encoder, wan_denoise, wan_image_to_latents, wan_latents_to_image - FlowMatchEulerDiscreteScheduler integration with on-disk config load - RectifiedFlowInpaintExtension reused for inpaint - 5D <-> 4D shape juggling: latents stay 4D in InvokeAI's pipeline, re-add T=1 only inside the transformer call / VAE encode-decode Phase 2 — A14B dual-expert MoE: - Probe reads boundary_ratio from model_index.json - Loader emits both transformer (high-noise) and transformer_low_noise (low-noise expert at transformer_2/) for A14B - _ExpertSwapper in wan_denoise drives GPU residency between experts: high-noise for t >= boundary_ratio * num_train_timesteps, low-noise below. Only one expert locked at a time so the cache can evict the other - relies on existing CachedModelWithPartialLoad to handle oversized models on lower-VRAM GPUs. - guidance_scale_low_noise field for separate low-noise CFG override Tests: - 24 passing tests covering probe variant detection, default settings, noise sampling, end-to-end denoise on a synthetic transformer (CPU), dual-expert boundary swap, CFG branch - 1 heavy-test placeholder gated by INVOKEAI_HEAVY_TESTS=1 for the real-weights smoke test Phase 3+ deferred: standalone VAE/encoder configs, GGUF, LoRA, ControlNet, ref image, inpaint UI, frontend wiring, starter models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 adds standalone VAE and UMT5-XXL encoder configs so users can run GGUF-quantized Wan transformers (Phase 4) without installing the full ~30 GB Diffusers pipeline. VAE configs: - VAE_Checkpoint_Wan_Config + VAE_Diffusers_Wan_Config (16-channel A14B vs 48-channel TI2V-5B, distinguished by decoder.conv_in z_dim). - 16-channel files share the AutoencoderKLWan architecture with Qwen Image; disambiguated via filename heuristic ("wan" in name -> Wan, otherwise -> Qwen Image). Mirror exclusion in QwenImage's probe. - VAELoader gets a Wan branch that builds AutoencoderKLWan(z_dim=...) via init_empty_weights, mirroring the QwenImage single-file pattern. - Existing standard VAE probe excludes both QwenImage- and Wan-style state dicts. UMT5-XXL encoder: - New ModelType.WanT5Encoder + ModelFormat.WanT5Encoder. - WanT5Encoder_WanT5Encoder_Config probes the diffusers folder layout (text_encoder/config.json with model_type=umt5, or flat layout with config.json at root). Refuses full Wan pipelines. - WanT5EncoderLoader handles both layouts and loads UMT5EncoderModel + AutoTokenizer. Component-source plumbing: - WanModelLoaderInvocation now exposes wan_t5_encoder_model and component_source pickers (mirrors QwenImage pattern). Resolution order: standalone > main (if Diffusers) > component_source. Required when the main model is a single-file format in Phase 4. Bug fix in wan_text_encoder: - Tokenizer was loading via AutoTokenizer.from_pretrained(<root>) directly, which fails for nested layouts where files live in <root>/tokenizer/. Now routed through the model cache so the registered loaders handle layout differences correctly. Frontend: - New type guards (isWanVAEModelConfig, isWanT5EncoderModelConfig, isWanMainModelConfig, isWanDiffusersMainModelConfig) and hooks/ selectors (useWanVAEModels, useWanT5EncoderModels, useWanDiffusersModels). New zSubModelType / zModelType / zModelFormat enum entries for transformer_2 and wan_t5_encoder. Tests: - 16 new tests covering z_dim detection, VAE checkpoint/diffusers probes, the bidirectional Qwen-vs-Wan filename deferral, and the UMT5 encoder probe (nested + flat + T5 + full-pipeline rejection). - Total Wan test count: 41 passing, 1 heavy-test placeholder skipped. - Full config test suite (63 tests) still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): unbreak frontend lint after Wan additions Five issues turned up running `make frontend-lint`: 1. wan_denoise.py used `from __future__ import annotations`, which made the `invoke()` return annotation a string ('LatentsOutput'). The InvocationRegistry's `get_output_annotation()` returns the raw annotation, so OpenAPI generation crashed with `'str' object has no attribute '__name__'`. Removed the future-import and added `Any` to the typing imports. 2. ModelRecordChanges.variant didn't list WanVariantType, so the generated schema's install/update endpoints rejected `t2v_a14b` and `ti2v_5b`. Added it. 3. Regenerated frontend/web/src/services/api/schema.ts from the live backend so it now includes BaseModelType.wan, ModelType.wan_t5_encoder, SubModelType.transformer_2, ModelFormat.wan_t5_encoder, the Wan variants, all Wan invocation types and their conditioning/transformer field types. 4. modelManagerV2/models.ts: added `wan_t5_encoder` to the category map, `wan` to the base color/long-name/short-name maps, the two Wan variants to the variant-name map, and `wan_t5_encoder` to the format-name map. 5. ModelManagerPanel/ModelFormatBadge.tsx: added `wan_t5_encoder` to FORMAT_NAME_MAP and FORMAT_COLOR_MAP. `make frontend-lint` now passes cleanly (tsc, dpdm, eslint, prettier). All 41 Wan Python tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(wan): drop unused FE exports flagged by knip These were forward-compatibility wiring for Phase 9 (the FE graph builder) that has no consumers yet; knip rightly flagged them. Removed or de-exported. They'll come back when the graph builder lands and needs them. - common.ts: zWanVariantType drops `export` (still used internally by zAnyModelVariant). - types.ts: drop isWanMainModelConfig, isWanDiffusersMainModelConfig, isWanVAEModelConfig (no callers). The remaining isWanT5EncoderModelConfig is used by models.ts. WanT5EncoderModelConfig type drops `export` (still used as the type guard's narrowing target). - modelsByType.ts: drop the six unused useWan*/selectWan* hooks + selectors and their type-guard imports. `make frontend-lint` (tsc + dpdm + eslint + prettier + knip) now green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> docs(wan): use *-Diffusers HF repo names in plan The Wan-AI org publishes two flavours of each release: * Wan-AI/Wan2.2-{TI2V-5B,T2V-A14B,I2V-A14B} ← upstream native * Wan-AI/Wan2.2-{TI2V-5B,T2V-A14B,I2V-A14B}-Diffusers ← convertible The native release has _class_name=WanModel in config.json and ships weights flat at the repo root with no transformer/, vae/, text_encoder/ subdirs. It is not loadable by Diffusers' WanPipeline.from_pretrained. Update plan doc to reference the -Diffusers repos throughout (probe notes, starter-model entries) so the plumbing path matches what the Diffusers loader actually expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): accept 0 as 'unset' sentinel for guidance_scale_low_noise The frontend renders Optional[float] inputs with default 0 in the numeric input rather than passing null/unset. Combined with ge=1.0, this caused every wan_denoise invocation to fail Pydantic validation with "Input should be greater than or equal to 1" until the user manually entered a value (or knew to leave the field disconnected). The validation error was rejected before invocation logging, so it never showed up in the server log either - making the failure hard to diagnose. Relaxing the constraint to ge=0.0 and treating values below 1.0 as the "fall back to primary Guidance Scale" sentinel. The user's natural FE default (0) now works as expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): correct preview dimensions and colors for TI2V-5B Two bugs in the Wan branch of the diffusion step callback: 1. Wrong dimensions. The reported preview size hardcoded `* 8` for the spatial downscale ratio, but TI2V-5B's Wan2.2-VAE uses 16x. A 1024x1024 target was being announced to the FE as 512x512. 2. Wrong colors. The previous fallback for 48-channel TI2V-5B latents sliced the first 16 channels and applied the standard 16-channel Wan-VAE projection. Those channel layouts are unrelated, so the projection produced meaningless colors. Adding the proper Wan2.2-VAE 48-channel RGB projection matrix (and bias) from ComfyUI's Wan22 latent format, and selecting the right matrix + spatial-scale by latent channel count: 16 → A14B (Wan VAE, 8x), 48 → TI2V-5B (Wan2.2-VAE, 16x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): honor model's _class_name when building scheduler TI2V-5B's scheduler_config.json declares _class_name=UniPCMultistepScheduler with flow_shift=5.0. The previous code hardcoded FlowMatchEulerDiscreteScheduler.from_pretrained(...), which silently constructed a default-config FlowMatch instead of the UniPC the model expects. The mismatched noise schedule manifests as soft / under-denoised faces and global graininess in the final images. Now: read scheduler_config.json, look up the named class on the diffusers module, and instantiate that class via from_pretrained. UniPC and FlowMatch share the same step()/set_timesteps()/sigmas/num_train_timesteps interfaces, so the denoise loop works transparently for either. A14B continues to use FlowMatchEulerDiscreteScheduler when its scheduler config says so (its reference is FlowMatchEuler with shift=8.0). Falls back to FlowMatchEulerDiscreteScheduler defaults when no on-disk config is available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): match diffusers WanPipeline tokenizer length and latent dtype Two divergences from the Diffusers reference that were hurting image quality (soft / grainy / distorted faces at default settings): 1. Tokenizer max_sequence_length was 226 in wan_text_encoder, but the model was trained with 512-token sequences. The upstream native config.json has text_len: 512, and Diffusers' WanPipeline.__call__ default is 512 (overriding _get_t5_prompt_embeds's stale 226 default). Wan's cross-attention sees padded zeros past the prompt's actual length but expects to be looking at a 512-position context window. 2. Latents were stored in bf16 throughout the denoise loop. Diffusers' WanPipeline.prepare_latents explicitly uses dtype=torch.float32 and only casts to the transformer's dtype right at the forward call: latent_model_input = latents.to(transformer_dtype) Storing in bf16 between steps accumulates ~40 steps of bf16 quantization on the scheduler's small per-step deltas. Now latent_dtype = torch.float32 throughout, with a per-step cast for the transformer forward pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(wan): add diffusers reference comparison script scripts/wan_diffusers_reference.py runs a Diffusers-format Wan 2.2 checkpoint directly via WanPipeline.from_pretrained, with the same arguments InvokeAI's wan_denoise uses. Use to A/B against InvokeAI output when image quality is questionable. Defaults to enable_model_cpu_offload so the script fits on 16 GB cards where the full pipeline (transformer + UMT5-XXL + VAE) would otherwise OOM. --offload {model,sequential,none} controls the strategy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds single-file GGUF support for Wan 2.2 transformers, the path that makes A14B usable on consumer GPUs (~7 GB/expert at Q4_K_M instead of ~28 GB at bf16). Probe (configs/main.py): - New helpers: _has_wan_keys (Wan vs Qwen/FLUX/Z-Image fingerprint via condition_embedder.text_embedder.linear_1 + patch_embedding); _detect_wan_gguf_variant (16ch -> A14B, 48ch -> TI2V-5B from patch_embedding.weight.shape[1]); _detect_wan_gguf_expert (filename heuristic for high_noise / low_noise / none). - Main_GGUF_Wan_Config(base=Wan, format=GGUFQuantized, variant, expert). Tolerates the ComfyUI 'model.diffusion_model.' / 'diffusion_model.' prefixes via _has_wan_keys' multi-prefix scan. - Registered in factory.py. Loader (model_loaders/wan.py): - WanGGUFCheckpointModel mirrors the QwenImage GGUF pattern: gguf_sd_loader -> strip ComfyUI prefix -> auto-detect arch from state dict shapes (num_layers, inner_dim, ffn_dim, text_dim, in_channels, num_heads = inner_dim/128) -> init_empty_weights + load_state_dict(strict=False, assign=True). Loader invocation (wan_model_loader.py): - New 'Transformer (Low Noise)' picker: optional second GGUF for the A14B dual-expert MoE. Auto-swaps if the user wired the experts in the wrong order. Warns when an A14B GGUF is loaded without a paired low-noise expert (single-expert run, degraded quality). - GGUF mains require either a standalone VAE+encoder or a Diffusers Component Source (which can also supply boundary_ratio). - Diffusers main path unchanged (still pulls both experts from transformer/ + transformer_2/). Tests (tests/.../test_wan_gguf_config.py): - 14 tests across key fingerprint, variant detection, expert filename heuristic, and the full probe (A14B high/low, TI2V-5B, GGUF rejection, unrecognised state-dict rejection, explicit override). Total Wan tests: 55 passing (no regressions). FE lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): support QuantStack-style GGUFs and standalone Diffusers VAE The city96 Wan 2.2 GGUF repos have been removed from Hugging Face, leaving QuantStack as the surviving distributor. QuantStack ships the native upstream Wan key layout (text_embedding.0/2, self_attn/cross_attn, ffn.0/2, head.head, head.modulation, ...) rather than the diffusers naming city96 used; biases are stored as F16 rather than BF16; and the standalone Wan VAE installs as a flat AutoencoderKLWan folder which the generic loader rejects. Three fixes: 1. Probe now recognises both diffusers and native key layouts via a new _is_native_wan_layout helper; _has_wan_keys accepts either text-proj fingerprint. 2. GGUF loader converts native -> diffusers keys (mirroring diffusers' convert_wan_transformer_to_diffusers) and unwraps non-quantized GGMLTensors to plain tensors at compute_dtype. The unwrap is needed because conv3d isn't in GGMLTensor's dispatch table, so the F16 patch_embedding bias would otherwise hit conv3d against bf16 latents. 3. VAELoader gains a VAE_Diffusers_Wan_Config branch that loads AutoencoderKLWan directly; the generic path can't handle a flat single-class folder when a submodel_type is provided. Adds 12 tests covering the native layout (probe + converter + unwrap). Verified end-to-end against Wan2.2-T2V-A14B-Q4_K_M from QuantStack: 1095 tensors round-trip key-for-key against WanTransformer3DModel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Probe + config (LoRA_LyCORIS_Wan_Config): - Detects Wan LoRAs in three layouts: diffusers PEFT, native upstream PEFT (ComfyUI), and Kohya (both naming variants). - Anti-pattern guards prevent collisions with Anima (Cosmos DiT q_proj convention), QwenImage (transformer_blocks), Flux (double/single blocks), and Z-Image (diffusion_model.layers). - Optional ``expert: "high" | "low" | None`` field; auto-detected from filename (high_noise / low_noise / hyphenated / concatenated variants). Key conversion (wan_lora_conversion_utils): - Native upstream keys (self_attn/cross_attn, ffn.0/2) -> diffusers (attn1/attn2, ffn.net.0.proj / ffn.net.2). - Strips ``transformer.``, ``diffusion_model.``, ``base_model.model.transformer.`` prefixes from PEFT-style keys. - Kohya layer names mapped through an explicit longest-match table. - Output paths use diffusers naming so the LayerPatcher can resolve them against WanTransformer3DModel parameter paths. Loader integration: - Adds BaseModelType.Wan branch to LoRALoader._load_model. Invocation nodes (wan_lora_loader.py): - WanLoRALoaderInvocation: single LoRA with auto/both/high/low target field. - WanLoRACollectionLoader: list of LoRAs, auto-routed by each LoRA's recorded expert tag. - Output WanLoRALoaderOutput carries the WanTransformerField with updated ``loras`` / ``loras_low_noise`` lists. Denoise integration: - _ExpertSwapper now manages both the model_on_device context and the LayerPatcher.apply_smart_model_patches context per expert. LoRA patches are entered after device load and exited before device release, with fresh iterators per swap. - GGUF (quantized) experts request sidecar patching so GGMLTensor weights aren't touched directly. - Low-noise expert falls back to the primary loras list when ``loras_low_noise`` is empty (matches WanTransformerField semantics). Tests: 81 new tests covering probe accept/reject across formats, anti-pattern guards on competing architectures, converter round-trips for all three layouts, invocation target resolution + routing + duplicate guards, and the _ExpertSwapper lifecycle (lora context opens/closes in the right order around the device swap, quantized flag forwards, no-LoRA path skips the patch context, re-entering the same label is a no-op). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): probe Wan LoRA before Anima in the config union Native-PEFT Wan LoRAs (lightx2v's Lightning, most ComfyUI-trained Wan LoRAs) carry keys like ``diffusion_model.blocks.X.cross_attn.k.lora_A.weight``. Anima's probe matches on the bare ``cross_attn``/``self_attn`` substring — it does not require the Anima-specific ``_proj`` suffix nor any of the ``mlp``/``adaln_modulation`` Cosmos DiT markers — so these Wan LoRAs were classified as ``BaseModelType.Anima`` because Anima happened to run first. Reorder the LyCORIS section of ``AnyModelConfig`` so Wan probes first. Wan's probe is strictly more restrictive (it rejects Anima's ``_proj`` attention suffix via the anti-pattern guard added in the previous commit), so Anima LoRAs are still correctly classified after this reorder. Existing users with mis-tagged installs need to delete the affected LoRA records and reinstall. Adds two regression tests: a union-ordering assertion, and a sanity check that demonstrates Anima's probe *would* match Wan native keys if asked directly — pinning the constraint that motivates the ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(i18n): add Wan2.2 T5 Encoder model-manager label The frontend source already references ``modelManager.wanT5Encoder``; the locale key was added with a casing typo (``want5Encoder``). Fix the key so the Wan T5 Encoder model type renders its display name correctly in the model manager UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-implementation after the first attempt — which used CLIP-vision conditioning — was reverted. Wan 2.2 I2V-A14B does NOT use a CLIP-vision encoder (the Diffusers repo ships ``image_encoder: [null, null]`` in ``model_index.json``); instead it conditions on a reference image by VAE-encoding it and concatenating the resulting latents (plus a first-frame mask) to the noise latents along the channel dim. The I2V transformer therefore has ``in_channels=36`` (16 noise + 16 ref-image latents + 4 mask) vs ``in_channels=16`` for T2V. Taxonomy: - Re-adds ``WanVariantType.I2V_A14B``. Probes: - Diffusers: ``_detect_wan_variant`` reads ``transformer/config.json::in_channels``; 36 → I2V_A14B, 16 → T2V_A14B (both share the dual-expert layout). - GGUF: ``_detect_wan_gguf_variant`` recognises ``in_channels=36`` from the patch_embedding tensor shape and emits I2V_A14B. Backend extension (``backend/wan/extensions/wan_ref_image_extension.py``): - ``preprocess_reference_image`` resizes + normalises to a 5D pixel tensor. - ``encode_reference_image_to_condition`` VAE-encodes the image and stacks a 4-channel first-frame mask on top, producing the ``[1, 20, 1, H/8, W/8]`` condition tensor the denoise loop consumes. - Mirrors diffusers ``WanImageToVideoPipeline.prepare_latents`` with ``num_frames=1`` and ``expand_timesteps=False``. Invocation node (``wan_ref_image_encoder.py``): - "Reference Image - Wan 2.2": image + VAE + width/height pickers. - Output ``WanRefImageConditioningField`` carries the condition tensor name plus the dimensions used (so the denoise step can validate dim parity). Denoise integration: - ``WanDenoiseInvocation`` gains an optional ``ref_image`` field. - Variant gate: rejects ref_image on T2V_A14B and TI2V-5B with a clear error before doing any work. - Dimension gate: rejects ref-image width/height mismatch vs denoise. - At every transformer call, concatenates the 20-channel condition tensor to the 16-channel noise latents along the channel dim before passing to the transformer (giving the 36-channel input I2V expects). Tests: 14 new across the probe, the extension, and the denoise loop. The synthetic ``_ZeroTransformer`` test stand-in now mirrors the real I2V transformer's ``in_channels=36, out_channels=16`` asymmetry by slicing its zero output back to 16 channels when the input is 36-wide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): derive GGUF out_channels from proj_out shape (I2V support) The GGUF loader was setting ``out_channels = in_channels`` which is wrong for Wan 2.2 I2V-A14B: that variant has ``in_channels=36`` (16 noise + 16 ref-image latents + 4 first-frame mask, concatenated by the denoise loop) but ``out_channels=16`` since the transformer only predicts the noise component back. Loading an I2V GGUF would build a transformer with the wrong proj_out shape and crash: RuntimeError: Error(s) in loading state_dict for WanTransformer3DModel: size mismatch for proj_out.weight: copying a param with shape torch.Size([64, 5120]) from checkpoint, the shape in current model is torch.Size([144, 5120]). (144 = 36 * 4, 64 = 16 * 4 — patch_size=(1, 2, 2) → prod=4) Read out_channels directly from the ``proj_out.weight`` shape in the state dict. This is correct for all three Wan 2.2 variants without needing to know the variant in advance. Also tighten the num_layers fallback: T2V_A14B and I2V_A14B share 40 layers; only TI2V-5B has 30. The fallback is rarely hit in practice (the per-block count comes from the state dict scan), but the previous code would have defaulted I2V_A14B to 30 layers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(model): make Anima LoRA probe mutually exclusive with Wan InvokeAI's ``Config_Base.CONFIG_CLASSES`` is a Python ``set``, so iteration order during model probing is non-deterministic across process restarts. First-match-wins ordering in ``AnyModelConfig`` is documentation only — it has no effect on which config is iterated first. Anima's previous probe accepted any state dict containing the substring ``cross_attn`` or ``self_attn``, which collides with Wan's native LoRA key layout (``diffusion_model.blocks.X.cross_attn.q.lora_down.weight``). Both probes accepted Wan native LoRAs (including lightx2v's Lightning T2V and I2V distillations), and the ``matches.sort_key`` tiebreaker only disambiguates by ModelType, not within LoRA configs. So which config "won" depended on dict hash order — sometimes Wan, sometimes Anima. The previous mitigation reordered the AnyModelConfig union to put Wan before Anima. That worked by luck and was inherently fragile. Tighten Anima's probe to require Cosmos-DiT-exclusive subcomponents: ``mlp``, ``adaln_modulation``, or ``_proj``-suffixed attention names (``q_proj``/``k_proj``/``v_proj``/``output_proj``) — none of which appear in any Wan LoRA. Wan native uses bare ``.q``/``.k``/``.v``/``.o`` on ``self_attn``/``cross_attn``, and ``ffn.N``/``ffn.net.N`` instead of ``mlp``. The new strict detectors live alongside the original loose ones so the Anima conversion utility (which runs after probing) still works. Regression tests in ``test_wan_lora_probe_independence.py`` cover: - I2V Lightning V1 (the bug-triggering LoRA), T2V Lightning V2, Wan Kohya and Wan diffusers PEFT layouts — Wan probe accepts, Anima probe rejects. - Anima PEFT and Kohya layouts — Anima accepts, Wan rejects. - A meta-test that runs every LoRA config in CONFIG_CLASSES against the Lightning state dicts and asserts exactly one accepts — this catches ANY future probe collision, not just Wan vs Anima. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): defer expert model loading in _ExpertSwapper to avoid cache thrash The swapper used to take pre-loaded ``LoadedModel`` handles at construction: high_info = context.models.load(self.transformer.transformer) low_info = context.models.load(self.transformer.transformer_low_noise) swapper = _ExpertSwapper(high_info=high_info, low_info=low_info, ...) With dual ~9 GB A14B GGUF experts plus the ~10 GB UMT5-XXL encoder competing for the same RAM cache, the LRU policy frequently dropped one expert by the time the denoise loop swapped into it. The model manager then emitted [MODEL CACHE] Locking model cache entry ... but it has already been dropped from the RAM cache. This is a sign that the model loading order is non-optimal in the invocation code (See ... invoke-ai#7513). and reloaded the weights from disk (~1.2s extra per swap). Refactor the swapper to take the ``ModelIdentifierField`` plus the ``InvocationContext`` and call ``context.models.load(model_id)`` lazily inside ``get()``. Each swap obtains a fresh handle, the LRU window is small, and the warning goes away. Config metadata (used to compute ``is_quantized``) is read upfront via ``context.models.get_config()`` — that's metadata, not weights, so it doesn't put pressure on the cache. Tests: existing swapper lifecycle tests refactored to use a fake context whose ``models.load`` is logged. A new ``test_lazy_load_per_swap_not_upfront`` pins the regression — it asserts ``models.load`` is NOT called at swapper construction, only at first get() per expert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The denoise_mask wiring + RectifiedFlowInpaintExtension integration in wan_denoise.py was put in place during Phase 2/3 alongside the rest of the denoise loop. Phase 8 of the plan was about ensuring this path worked and is locked in by tests. Three new tests under TestWanDenoiseInpaint: 1. test_preserved_region_matches_init_exactly: builds a half/half mask (left = preserve, right = regenerate in user-side convention), runs full denoise with the synthetic zero-output transformer, and asserts the preserved half of the final latents equals the init exactly while the regenerated half does not. Pins the mask-inversion + per-step merge behavior. 2. test_inpaint_requires_init_latents: a mask without init latents must raise a clear ValueError — the merge has nothing to weld back to. 3. test_no_mask_path_is_unchanged: regression that adding the inpaint extension didn't perturb the non-inpaint codepath (with init latents + denoising_start=0.5 but no mask, the loop just runs img2img). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(frontend): add I2V_A14B to Wan variant zod enum + manager label Phase 7 added the I2V_A14B backend variant. The frontend's zod enum (features/nodes/types/common.ts:zWanVariantType) and the model manager's variant-label map (features/modelManagerV2/models.ts) were still on the two-variant list, so: - ModelIdentifierField inputs with ui_model_variant filters on Wan couldn't list I2V models. - The model manager UI showed a raw 'i2v_a14b' string instead of the human label. Phase 9 (full linear-view wiring — type guards, hooks, params slice, graph builder, tab UI) is in progress on a follow-up commit; this lands the two small enum fixes first so the I2V probe / install paths work correctly end-to-end with the existing FE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the minimum frontend wiring needed to generate Wan 2.2 images from the linear view: - buildWanGraph.ts (new): text-to-image graph (model_loader → text_encoder × 2 → denoise → l2i). Diffusers main model only — transformer, VAE and UMT5 encoder all resolve from the same repo, so no Wan-specific params slice fields are required yet. CFG-skip branch when guidance_scale ≤ 1.0. - useEnqueueGenerate / useEnqueueCanvas dispatchers: route base === 'wan' to buildWanGraph. - graph/types.ts: add wan_l2i / wan_i2l / wan_denoise / wan_model_loader to the relevant node-type unions. - addTextToImage / addImageToImage: include wan_denoise / wan_l2i so width/height are wired correctly and the txt2img helper accepts the Wan l2i node. - isMainModelWithoutUnet: include wan_model_loader (Wan has no UNet, same as the other modern bases). - metadata.py: add wan_txt2img / wan_img2img / wan_inpaint to the generation_mode enum (img2img / inpaint pieces land next). - schema.ts: regenerated to pick up the metadata enum + new Wan invocations. Pieces left in Phase 9: params slice (standalone VAE / T5 / GGUF low-noise / LoRA / ref-image fields + selectors), img2img + I2V + inpaint branches in the graph builder, and Wan-specific UI components. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(wan): Phase 9 piece #2 - GGUF support and CFG-Low control in linear view Adds the three Wan-specific params + UI controls that gate GGUF workflows plus a separate low-noise CFG slider for A14B users. Params slice: - wanTransformerLowNoise (the second-expert GGUF for A14B) - wanComponentSource (Diffusers Wan model providing VAE + UMT5-XXL when the main is a GGUF) - wanGuidanceScaleLowNoise (optional separate CFG for the low-noise expert; null = fall back to the primary CFG) Plus a `selectIsWan` selector for accordion gating. UI components: - ParamWanModelSelects.tsx (Advanced accordion): two model pickers — Transformer (Low Noise) filtered to Wan GGUF mains, and VAE/Encoder Source filtered to Wan Diffusers mains. Mirrors the ParamQwenImageComponentSourceSelect structure. - ParamWanGuidanceScaleLowNoise.tsx (Generation accordion): slider + number input with an "auto" indicator when cleared. Default 3.5 matches the diffusers reference 4.0 / 3.0 split. Wiring: - Generation accordion: ParamWanGuidanceScaleLowNoise shown when base is wan, scheduler excluded for wan (same pattern as Anima/Qwen). - Advanced accordion: ParamWanModelSelects shown when base is wan, and Wan excluded from the SD-family VAE/CFG-rescale blocks. - buildWanGraph.ts: forwards the three new params to the model loader and denoise nodes (transformer_low_noise_model, component_source, guidance_scale_low_noise) and adds them to the graph metadata. Hooks/types: - useWanDiffusersModels + useWanGGUFModels in modelsByType.ts. - isWanDiffusersMainModelConfig + isWanGGUFMainModelConfig type guards. - Three new locale strings (wanComponentSource, wanTransformerLowNoise, wanGuidanceScaleLowNoise[Auto]). GGUF workflow now works end-to-end in the linear view: pick a Wan GGUF main, set Transformer (Low Noise) to the paired second-expert GGUF, set VAE/Encoder Source to any Diffusers Wan repo (TI2V-5B is convenient at ~12 GB) — generate produces an image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): UX polish on the Wan linear-view controls Bundles four small fixes applied during a usability review of the Wan linear-view section (piece #2): 1. **Filter Main vs Transformer (Low Noise) dropdowns by expert tag.** The Wan GGUF probe records each file's ``expert`` field (``"high"`` / ``"low"`` / ``"none"``) via filename heuristic. - ``MainModelPicker``: hides ``expert === 'low'`` Wan GGUFs so users can't accidentally wire a low-noise expert as the primary main. - Transformer (Low Noise) picker (``useWanGGUFLowNoiseModels``): shows ``expert === 'low'`` Wan GGUFs only. Diffusers Wan mains and TI2V-5B aren't affected — they don't carry the ``expert`` field on their config schema. The backend's auto-swap safety net stays in place. 2. **Match the primary CFG slider's range.** The Wan low-noise CFG slider was constrained to 1–10 while the primary CFG ranges 1–20. With the diffusers reference 4/3 split, the low-noise slider thumb sat noticeably further right than the primary — visually misleading. Both sliders now share the 1–20 range with marks at [1, 10, 20]. 3. **Label fits the form column.** "CFG (Low Noise)" → "CFG (Low)" so the slider fits cleanly next to its label instead of overlapping. 4. **Indicator state for the low-noise CFG slider.** Replaced the inline "(auto)" / "(same as cfg)" text — which kept overlapping the slider regardless of how short the label got — with an X-only reset button that's only visible when the user has set an explicit value. Absence of the X conveys auto/fallback state without any text overhang. 5. **Friendlier Transformer (Low Noise) placeholder.** "Second-expert GGUF for A14B (pair with the high-noise main)" → "Add for full detail" — concise nudge for users who haven't paired the second expert yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(wan): Phase 9 piece #3 - linear-view img2img branch Adds Wan 2.2 image-to-image to the linear view, mirroring the Qwen Image pattern. The mode switches on the canvas state — pure-prompt runs go through addTextToImage as before; canvas runs with an init image go through addImageToImage which wires a fresh wan_i2l (Image to Latents - Wan 2.2) node between the init image and the denoise's `latents` input, honoring the existing denoise_start slider. buildWanGraph: - Drops the txt2img-only guard, branches on generationMode. - img2img: spins up a wan_i2l node and hands it to addImageToImage alongside the existing denoise / l2i / modelLoader (as vaeSource). - inpaint / outpaint still fail loudly — pieces #4-#6. graphBuilderUtils.getDenoisingStartAndEnd: - Adds 'wan' to the simple-linear case (denoising_start = 1 - denoisingStrength). Note: Wan's flow-matching schedule is "sticky" on the init compared to SDXL — users will likely need denoisingStrength ≥ 0.7 to see substantial change, matching the user-found 0.15-0.3 denoising_start sweet spot from earlier img2img testing. We may revisit this with an exponent rescale (like FLUX uses) if the response curve feels off. addImageToImage: - Adds 'wan_i2l' to the i2l-node-type union so the Wan i2l can be threaded through the shared helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): add wan_denoise to addImageToImage/addInpaint/addOutpaint type checks Three sibling graph-helper utilities had the same modern-base list as addTextToImage did, and the buildWanGraph img2img branch tripped one of them at canvas-Generate time: error [generation]: Failed to build graph {name: 'Error', message: 'Wrong assertion encountered'} The else-branch in each helper assumes 'denoise_latents' (the SD1.5/SDXL legacy path) and asserts that — failing for any modern base not listed above the branch. addTextToImage was already updated in Phase 9 piece #1; this catches the parallel cases that the img2img/inpaint/outpaint flows go through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(wan): Phase 9 piece #4 - linear-view inpaint and outpaint branches Wires Wan 2.2 inpaint and outpaint through the existing addInpaint / addOutpaint helpers. The backend's RectifiedFlowInpaintExtension was plumbed into wan_denoise.py back in Phase 8 (commit ab54617); this just connects the FE. buildWanGraph: - generationMode === 'inpaint' → spin up a wan_i2l, call addInpaint with denoise + l2i + modelLoader (used as both vaeSource and modelLoader since the Wan model loader carries the VAE). - generationMode === 'outpaint' → parallel branch with addOutpaint. addInpaint: - i2l-node-type union now includes 'wan_i2l' (the addImageToImage and addOutpaint type unions already do — different union shapes). metadata.py: - generation_mode literal adds "wan_outpaint" alongside the existing wan_txt2img / wan_img2img / wan_inpaint entries. isMainModelWithoutUnet already includes wan_model_loader (Phase 9 piece create_gradient_mask when Wan is the main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(wan): Phase 9 piece #5 - linear-view I2V branch (raster as reference image) Wan 2.2 I2V-A14B models condition on a reference image whose VAE-encoded latents are concatenated to the noise along the channel dim each step (in_channels=36 on the I2V transformer). In the linear view this maps cleanly onto the existing canvas raster layer: pick an I2V model, drag an image to raster, generate. buildWanGraph: - Fetch the modelConfig early so the variant gate (i2v_a14b vs the rest) can drive the branch shape instead of being a post-hoc check. - I2V + txt2img: fail loudly ("Switch to the canvas tab and drag an image to the raster layer"). I2V models won't produce useful output without a reference, and the backend would crash trying to concatenate a missing condition tensor. - I2V + img2img: pull the raster image via the canvas compositor, wire it through a wan_ref_image_encoder (which VAE-encodes it and builds the 4-mask + 16-latent condition tensor backend-side), then feed the result into denoise.ref_image. Denoise runs from fresh noise (denoising_start=0, no init_latents) — the ref image is cross-attention/concat conditioning, not a noise-trajectory anchor. - I2V + inpaint/outpaint: fail clearly. Combining ref-image conditioning with a denoise mask is conceptually possible but the backend interaction hasn't been validated end-to-end. metadata.py: - Adds "wan_i2v" to the generation_mode literal so the metadata field on I2V renders correctly. T2V flows (txt2img / img2img / inpaint / outpaint) are unchanged for non-I2V Wan variants (T2V-A14B and TI2V-5B). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): enforce multiple-of-16 dimensions to match transformer patch grid Wan 2.2's transformer has ``patch_size=(1, 2, 2)``: it patch-embeds with stride 2 then un-patches by 2. Combined with the VAE's 8x spatial scale, canvas H/W must be a multiple of ``8 * 2 = 16`` — not just 8 — for the patch round-trip to land exactly. Otherwise the latents and noise prediction disagree by one in the spatial dim and the scheduler step fails: RuntimeError: The size of tensor a (147) must match the size of tensor b (146) at non-singleton dimension 3 (here latent_w=147 → patch_w=73 → un-patched_w=146 ≠ 147) This was silent for T2V at 1024x1024 (already a multiple of 16) but fired for I2V at non-multiple-of-16 canvas sizes. Fixes: - ``optimalDimension.getGridSize``: Wan moves from the default 8 case to the multiple-of-16 case (alongside flux / sd-3 / qwen-image / z-image which have the same patch arithmetic). The canvas bbox UI now snaps Wan dimensions to multiples of 16. - ``wan_denoise.py`` and ``wan_ref_image_encoder.py``: bump width/height ``multiple_of`` from 8 to 16. Defense-in-depth — workflow-editor users won't be able to send a non-16-aligned dim either. Existing backend tests (23 passing) still hold — 1024 is divisible by 16 so the test fixtures didn't exercise the off-by-one path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): show negative prompt box in Wan linear-view Wan was missing from SUPPORTS_NEGATIVE_PROMPT_BASE_MODELS, so the linear-view negative-prompt input was hidden even though the Wan denoise node already wires negative conditioning when CFG > 1 (buildWanGraph.ts:67-75). Adds 'wan' to the list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(wan): Phase 9 piece #6 - Wan LoRA collection in linear view Adds Wan LoRA wiring to buildWanGraph, mirroring the Qwen Image pattern. The shared LoRASelect / LoRAList UI in the linear view already filters LoRAs by the selected main model's base, so Wan LoRAs surface automatically when a Wan main is picked — no UI changes needed. addWanLoRAs (new): - Filters state.loras.loras to enabled Wan LoRAs. - For each LoRA: spawns a ``lora_selector`` node and threads it through a single ``collect`` collector. - Routes the collector into a ``wan_lora_collection_loader`` which sits between modelLoader and denoise — modelLoader.transformer → loader, then loader.transformer → denoise (rerouting the original modelLoader → denoise edge). - Emits per-LoRA metadata so PNG metadata + workflow restore work. The dual-expert routing (high-noise vs low-noise vs untagged) is handled entirely on the backend by ``WanLoRACollectionLoader`` based on each LoRA's recorded ``expert`` tag (set by the probe from the filename heuristic in piece #5 of Phase 5). The FE just hands over the bag of LoRAs; no per-list FE plumbing needed. buildWanGraph: - Calls addWanLoRAs(state, g, denoise, modelLoader) after the base transformer edge is in place. The helper is a no-op when no Wan LoRAs are enabled, so it's safe to call unconditionally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(wan): detect LoRA variant and filter by main model Wan 2.2 A14B (inner_dim=5120) and TI2V-5B (inner_dim=3072) LoRAs are not interchangeable — applying one against the wrong main model crashes the layer patcher with a tensor-shape error (e.g. A14B Lightning on TI2V-5B mains produced ``shape '[3072, 3072]' is invalid for input of size 26214400``). Probe Wan LoRAs' inner-dim at install time and record the family on a new ``variant`` field (``a14b`` / ``5b`` / null). The LoRA picker in the linear view hides incompatible variants when the user selects a main, and the graph builder filters any still-enabled mismatches at submit time with a warning. Untagged LoRAs (probe couldn't identify) pass through so they aren't silently hidden. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(wan): ref-image panel, GGUF readiness, and auto-default sources Wan 2.2 I2V now uses the global Reference Images panel (same UX as Qwen Image Edit and FLUX.2 Klein) instead of pulling the conditioning image from a canvas raster layer. Adds: - WanReferenceImageConfig zod type + isWanReferenceImageConfig guard; integrated into the ref-image discriminated union, settings panel, layer hooks, and validators. - 'wan' added to SUPPORTS_REF_IMAGES_BASE_MODELS, but the panel only shows for the i2v_a14b variant (T2V and TI2V-5B don't consume ref images, so the panel is hidden for them). - buildWanGraph I2V branch reads the first enabled wan_reference_image from refImagesSlice; the canvas-raster-as-ref path is removed. I2V now only supports txt2img mode (canvas img2img/inpaint/outpaint assert with a clear message). GGUF Wan readiness check: GGUF mains carry only the transformer, so the loader needs a Diffusers Component Source (or standalone VAE + UMT5-XXL encoder) to resolve the VAE and text encoder. Without one, enqueue is now blocked with a clear reason. The low-noise A14B partner expert remains optional (loader falls back to the high-noise expert when it's missing). Adds standalone Wan VAE and Wan T5 Encoder selectors to the Advanced accordion (Qwen pattern). Wires them as vae_model / wan_t5_encoder_model on the wan_model_loader node — backend priority is standalone > diffusers main > component source. Auto-default on Wan selection (so GGUF users don't have to fiddle with Advanced): when the new main is a Wan GGUF, fill the Component Source, standalone VAE, and standalone T5 encoder with first available matches if not already set. Component Source is matched by variant family (A14B GGUF prefers an A14B Diffusers; TI2V-5B prefers a TI2V-5B Diffusers) since the two families use different VAE channel counts (16 vs 48); within A14B, T2V and I2V share VAE/encoder so they're interchangeable as a source. Runs on every Wan selection (including Diffusers -> GGUF switches), only fills empty slots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wan 2.2 starter pack (selected when the user picks the Wan 2.2 bundle) brings up the minimal-cost path to running A14B T2V end-to-end: - Standalone UMT5-XXL encoder and A14B VAE (so GGUF mains don't need a full Diffusers download for their VAE/encoder sources). - T2V A14B Q4_K_M and Q8_0 GGUF expert pairs (high + low noise). - T2V Lightning V1.1 Seko rank-64 LoRA pair (4-step inference). Additional Wan 2.2 starter models browseable from the model manager: - Full Diffusers T2V A14B, I2V A14B, and TI2V-5B. - I2V A14B Q4_K_M and Q8_0 GGUF expert pairs + Lightning V1 LoRA pair. - TI2V-5B Q4_K_M and Q8_0 GGUFs + the 48-channel TI2V-5B VAE. Each "high noise" GGUF lists its low-noise partner plus the shared VAE and UMT5-XXL encoder as dependencies, so installing one of them pulls in everything the loader needs. QuantStack's HighNoise/LowNoise file naming and lightx2v's high_noise_model/low_noise_model.safetensors are both picked up by the existing filename heuristic in the GGUF probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> docs(wan): add Wan 2.2 hardware requirements Adds Wan 2.2 A14B (T2V/I2V) and TI2V-5B rows to the hardware requirements table with rough VRAM/RAM guidance per quantization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…one VAE/T5 Wan-specific metadata fields embedded by the graph builder (wan_transformer_low_noise, wan_component_source, wan_vae_model, wan_t5_encoder_model, wan_guidance_scale_low_noise) had no recall handlers in features/metadata/parsing.tsx, so recalling an image's parameters would leave these fields empty. Adds a handler for each that dispatches the matching paramsSlice action and renders a row in the metadata viewer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ships two default workflows in the library, tagged so they appear in "Browse Workflows" under the wan2.2 / text to image / image to image tags: - Text to Image - Wan 2.2: full T2V/TI2V-5B graph (model loader, positive + negative encoders, denoise, l2i). Exposes the five model slots, prompts, steps, dual CFG, and dimensions. - Image to Image - Wan 2.2: I2V A14B graph that adds a wan_ref_image_encoder. Exposes the reference image input plus the standard fields. Both follow default-workflow rules: IDs prefixed with default_, meta.category = "default", and no references to user-installed resources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a parallel video pipeline alongside the existing image pipeline so the gallery can host MP4 alongside PNGs. Implements: - New service modules (parallel to image equivalents): video_records/ record store + sqlite impl video_files/ disk file store (mp4 + first-frame webp thumb) videos/ orchestrating service board_video_records/ board <-> video association - migration_32 creates `videos` and `board_videos` tables - /api/v1/videos/ router: upload, list, get DTO, /full (with HTTP Range so HTML5 <video> seek/scrub works), /thumbnail, /metadata, star/unstar, delete, batch delete, board add/remove - LocalUrlService.get_video_url and SimpleNameService.create_video_name - imageio[ffmpeg] dep for video encode (used in later phases) - Wires all four new services into InvocationServices, dependencies.py, api_app.py, and three test fixtures Verified end-to-end against an in-memory db + tmp output dir: upload, probe, save (file + thumbnail + record), DTO build, list, delete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds /api/v1/gallery/items/ and /api/v1/gallery/items/names returning a unified time-sorted stream of images + videos so the frontend can render them interleaved with a single virtualized query. - gallery_common: GalleryItem discriminated union (kind + name + shared fields + nullable video duration/fps), GalleryItemRef, names result - gallery_default: SqliteGalleryService implements UNION ALL across the images and videos tables, applying identical filters (origin/category/ is_intermediate/board_id/search) to each half; pagination via outer ORDER BY + LIMIT/OFFSET; counts are summed across the two halves - URLs are resolved at row -> DTO conversion time so each item routes to the correct /api/v1/images or /api/v1/videos endpoint - Wired into InvocationServices, dependencies.py, api_app.py, and the three test fixtures Existing /api/v1/images endpoints are unchanged so any non-gallery consumers (queue, recall, metadata workflows) continue to work as-is. Verified e2e: 2 images + 2 videos inserted in alternating order, both list_items and list_item_names return the correct interleaved order; category filter narrows to a single kind; starring an item bumps it to the top when starred_first=True. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the typed API surface and upload integration so videos can be uploaded through the same gallery upload button that handles images. Schema: re-ran pnpm typegen against the running backend to pick up VideoDTO, VideoRecordChanges, GalleryItem, GalleryItemKind, GalleryItemRef, GalleryItemNamesResult and the two new paginated result types. RTK Query (services/api/endpoints/videos.ts) - parallel to images.ts: listVideos, getVideoDTO, getVideoMetadata, getVideoNames, uploadVideo, deleteVideo / deleteVideos, changeVideoIsIntermediate, starVideos / unstarVideos, addVideoToBoard / removeVideoFromBoard. Imperative helpers (getVideoDTO, getVideoDTOSafe, uploadVideo, uploadVideos) and the useVideoDTO convenience hook ride alongside, mirroring the image side. Tag types and invalidation: added Video / VideoList / VideoMetadata / VideoNameList / BoardVideosTotal / GalleryItemList / GalleryItemNameList to the api root. Board-affecting mutations now invalidate the polymorphic gallery list/name caches so videos and images stay coherent once the gallery wiring lands in Phase 4. Added a sibling getTagsToInvalidateForVideoMutation helper. Upload UX: useImageUploadButton.tsx's dropzone now accepts video/mp4, video/webm, video/quicktime alongside the existing image MIMEs. The drop handler splits files into image/video sets and routes each through its own mutation; a new onUploadVideo callback parallels the existing onUpload. Existing image-only callers pass through unchanged. Polymorphic gallery query endpoints + the useGalleryItemDTO hook will land with Phase 4 where they have actual consumers; the schema types they'll need are already in place under @knipignore tags. Verified: pnpm lint (knip, dpdm, eslint, prettier, tsc) all green; pnpm test 1103/1103 pass; live curl against the running dev server uploads an MP4 and serves both the webp thumbnail and the MP4 with a working HTTP Range response (206 + Content-Range). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Videos now appear in the same gallery grid as images, interleaved by created_at. Video thumbnails get a centered play-button badge so they read as videos at a glance; everything else (selection, virtualization, search, paged/virtual gallery views, keyboard nav) is unchanged. Approach: selection state stays `string[]` of names. The kind is recovered from the filename extension (.mp4 = video, anything else = image), which is reliable because the backend's SimpleNameService always emits `<uuid>.png` for images and `<uuid>.mp4` for videos. This sidesteps a 32-file cross-cut from changing the selection shape to a discriminated union, and selection is persist-denylisted so no migration is needed. Frontend: - new isVideoName helper in features/gallery/store/types - new endpoints/gallery.ts (deferred from Phase 3): useGetGalleryItemNamesQuery - new ImageGrid/GalleryItemPlayBadge: centered triangular badge over thumbnail - new ImageGrid/GalleryItemVideoStarIconButton: video-typed star toggle - new ImageGrid/GalleryVideoItem: counterpart to GalleryImage; reuses galleryItemContainerSX, GalleryItemSizeBadge (width/height-only stand-in), selection handling (single/shift/ctrl/cmd); alt-click falls through to a normal select since comparison is image-only - use-gallery-image-names now calls the polymorphic gallery names endpoint and exposes a mixed flat name list (existing callers - paged grid, search, navigation hotkeys - get the same shape) - useRangeBasedImageFetching partitions visible names by extension; images bulk-fetch via the existing getImageDTOsByNames mutation, videos dispatch individual getVideoDTO queries (no batch endpoint yet) - GalleryImageGrid's ImageAtPosition dispatches on isVideoName to render GalleryImage or GalleryVideoItem; star hotkey dispatches to the right star/unstar mutation based on kind - pruned the now-unused useGetImageNamesQuery / isImageName exports Verified: pnpm lint (knip, dpdm, eslint, prettier, tsc) all green; pnpm test 1103/1103 pass; live curl of /api/v1/gallery/items returns 57 polymorphic items with video duration populated and image duration null, /api/v1/gallery/items/names returns matching {kind, name} refs. The useGalleryItemDTO hook is intentionally deferred to Phase 5 where the polymorphic viewer is its first real consumer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Selecting a video now renders a polymorphic preview inside the existing viewer panel: thumbnail with a centered play button by default; clicking play swaps in an HTML5 <video controls autoplay>. Switching to a different item drops the video element back to idle (auto-pauses) and selecting an image again returns to the normal image preview. New components (features/gallery/components/ImageViewer/): - VideoPlayButtonOverlay: large centered play button with hover/shadow, used over the thumbnail in the idle state. - CurrentVideoPreview: idle/playing state machine. Resets on video_name change. The <video> src points at /api/v1/videos/i/.../full which supports HTTP Range, so seek/scrub work natively in the browser. New hook: - common/hooks/useGalleryItemDTO: polymorphic DTO resolver that dispatches between useImageDTO and useVideoDTO based on filename extension (isVideoName). Centralizes the kind-dispatch the viewer and toolbar both need. Wiring: - ImageViewer dispatches on galleryItem.kind to render CurrentImagePreview or CurrentVideoPreview. The compare-image DnD drop target is hidden when a video is selected (comparison is image-only). - ImageViewerToolbar hides the image-specific action row (CurrentImageButtons - load workflow, recall metadata, edit, etc.) and the metadata viewer toggle when a video is selected. The general-purpose ToggleProgressButton stays. Out of scope (per the plan): video deletion from the viewer (use gallery hover icons), video-specific metadata viewer, comparison-mode support for videos. Verified: pnpm lint (knip, dpdm, eslint, prettier, tsc) all green; pnpm test 1103/1103 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pzone The gallery-wide drag-and-drop target lives in FullscreenDropzone, not in useImageUploadButton (which only powers the upload button). It had its own hardcoded image-only zod allowlist that rejected MP4 files with "File type / extension is not supported". - Broaden the zod refines to accept video/mp4, video/webm, video/quicktime, video/x-matroska and the matching extensions - Add isVideoFile helper, split dropped files into image/video sets, and route each set through its own uploader (uploadImages / uploadVideos). Both update their respective RTK caches and invalidate the polymorphic gallery list/names. - Skip the canvas-paste fast-path for single-video drops — the canvas doesn't host videos as layers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a three-item context menu (delete, change board, download) on right-click / long-press of any gallery video item. Mirrors the image context menu's singleton-portal architecture so re-renders stay cheap. New files: - features/gallery/contexts/VideoDTOContext: small React context that scopes the active video DTO to the menu items (parallels ImageDTOContext). - features/gallery/components/ContextMenu/MenuItems/ ContextMenuItemDeleteVideo: window.confirm + deleteVideo mutation. Videos can't be referenced from canvas/nodes/refs, so the image modal's usage analysis is unnecessary; a one-step confirm matches the "minimal" scope. ContextMenuItemDownloadVideo: reuses the existing useDownloadItem hook against videoDTO.video_url / video_name. ContextMenuItemChangeBoardVideo: dispatches videosToChangeSelected and opens the (now polymorphic) ChangeBoardModal. - features/gallery/components/ContextMenu/VideoContextMenu: singleton pattern lifted from ImageContextMenu — registers gallery video elements via a Map; right-click looks up the target node and opens the menu at the cursor. Extended files: - features/changeBoardModal/store/slice: added video_names alongside image_names plus a videosToChangeSelected action. The two arrays are mutually exclusive — setting one clears the other. - features/changeBoardModal/components/ChangeBoardModal: now dispatches the matching video board mutations (add/removeVideoToBoard, plural endpoints don't exist yet so videos move one at a time — the menu acts on a single selection so this is a one-iteration loop). - features/gallery/components/ImageGrid/GalleryVideoItem: registers itself with useVideoContextMenu. - app/components/GlobalModalIsolator: mounts the singleton. Verified: pnpm lint (knip, dpdm, eslint, prettier, tsc) all green; pnpm test 1103/1103 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two new invocation nodes that produce MP4 videos from a Wan 2.2 A14B transformer + VAE, plus the supporting plumbing. New invocations: - WanVideoDenoise (wan_video_denoise) — multi-frame counterpart to WanDenoise. Same per-step logic (CFG, MoE expert swap at the boundary timestep, LoRA patching, scheduler dispatch) — reuses _ExpertSwapper, _resolve_variant, and the scheduler/LoRA helpers from wan_denoise. Difference: the noise tensor has a real temporal dim built from num_frames, and the I2V condition is built across all latent frames (frame 0 conditioned, rest zero). Defaults match the Wan 2.2 reference: 832x480 / 81 frames / 40 steps / CFG 5.0 (high) + 4.0 (low). Inpaint / img2img are out of scope for this first cut. TI2V-5B is rejected; T2V/I2V A14B only. - WanLatentsToVideo (wan_l2v) — VAE-decodes 5D latents to RGB frames via AutoencoderKLWan (T_pixel = (T_lat - 1) * 4 + 1), then encodes an MP4 with imageio[ffmpeg] (libx264, yuv420p for browser compatibility). The temp file is moved into outputs/videos/ via context.videos.save(). Backend shared pieces: - make_noise gains num_latent_frames (default 1, backward compatible). - Added num_latent_frames_for(num_frames, scale=4) helper. - New encode_reference_image_to_video_condition mirrors diffusers' WanImageToVideoPipeline.prepare_latents with last_image=None and expand_timesteps=False: pads the reference image with zero pixel-frames, VAE-encodes the full pseudo-video, normalises, and builds the 4-channel temporal-rearranged first-frame mask. Verified numerically: 21 latent frames for num_frames=81, first latent frame's 4 mask channels = 1, rest = 0. - The existing single-frame encoder is left untouched. Schema / context: - New VideoField primitive (parallel to ImageField) and VideoOutput invocation output (width/height/num_frames/fps/duration/video). - New VideosInterface on InvocationContext with .save(source_path, width, height, duration, fps, ...) returning VideoDTO. Mirrors ImagesInterface — falls back to WithBoard / WithMetadata mixins and embeds the queue item's workflow/graph as a JSON sidecar. - WanRefImageConditioningField now carries num_frames so the denoise nodes can sanity-check the I2V condition. WanRefImageEncoder bumps to v1.1.0 and gains num_frames=1 input (use 81+ for video I2V; the encoder dispatches between the single- and multi-frame helpers). - Image WanDenoise now rejects multi-frame conditions with a clear message pointing at WanVideoDenoise. Verified: pnpm lint (5/5) green; pnpm tests (multiuser auth 122/122 + broader suite via prior runs); numerical shape checks for noise and ref-image condition; end-to-end smoke via VideoService.create. A restart of the InvokeAI server is required to pick up the new invocations in the workflow editor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two new default workflows for the workflow editor 'Browse' modal: - 'Text to Video - Wan 2.2' — model loader -> two text encoders -> wan_video_denoise -> wan_l2v. Exposes prompt, model picks, CFG (high + low), dimensions, frames, fps, and steps. - 'Image to Video - Wan 2.2' — same shape plus a wan_ref_image_encoder feeding the denoise node's ref_image input. Exposes the reference image and the frames field on the ref-image node (must match the denoise node's frames — there is a clear validation error if they diverge, but the starter has them in sync at 81). Both default to the Wan 2.2 reference settings: 832x480, 81 frames @ 16 FPS (~5 s), 40 steps, CFG 5.0 (high expert) + 4.0 (low expert), seeded by a rand_int. Pass the existing _sync_default_workflows validator (id starts with default_, meta.category=default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

run_app.py validates every invocation's return-type annotation against the output-class registry. wan_latents_to_video.py had a stray 'from __future__ import annotations' which made the `invoke()` return annotation a string ('VideoOutput') at runtime. The registry mismatch triggered the unregistered-output warning path, which itself crashed on output_annotation.__name__ because the annotation was a str: AttributeError: 'str' object has no attribute '__name__' The other Wan invocations don't use future annotations — drop the import to match. Verified post-fix: api_app import populates 95 output classes, wan_l2v annotation resolves to the real VideoOutput class and is in the registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same graph as 'Text to Video - Wan 2.2' but with two Apply LoRA - Wan 2.2 nodes chained between the model loader and the denoise node, and defaults retuned for the Lightning distillation: 4 steps and CFG 1.0 on both experts (CFG=1 skips the negative-conditioning forward pass entirely, ~20x faster than the 40-step / CFG-5.0 baseline at similar quality). Adapted from a user-saved workflow; cleaned for distribution by stripping the install-specific model/LoRA key bindings (defaults should not bake in local UUIDs), bumping to a fresh default_-prefixed id with meta.category=default, exposing the two LoRA fields (lora + weight) so users can swap LoRAs without diving into the canvas, and flagging the negative-prompt node as unused at CFG=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two new default workflows that wire the Lightning LoRA pair into the T2V and I2V video pipelines for a ~20x speedup: - 'Text to Video - Wan 2.2 Lightning' — model loader -> apply LoRA (high) -> apply LoRA (low) -> text encoders -> wan_video_denoise -> wan_l2v. Defaults to 4 steps and CFG 1.0 (no negative branch). Cleaned-up version of Lincoln's saved Lightning workflow: stripped per-install model/LoRA keys, switched meta.category to 'default' with a default_ id, and exposed both LoRA loaders' lora/weight/ target fields so users can swap LoRAs without diving into the canvas. - 'Image to Video - Wan 2.2 Lightning' — same chain plus a wan_ref_image_encoder (v1.1.0 with num_frames) feeding the denoise ref_image input. Defaults match the non-Lightning I2V starter (832x480, 81 frames @ 16 FPS) but with 4 steps / CFG 1.0. LoRA target defaults to 'auto' so properly-tagged Lightning LoRAs route themselves; both workflow descriptions tell users to set explicit 'high'/'low' targets if their LoRAs are untagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wan_latents_to_video was passing plugin='pyav' to iio.imwrite, but the runtime only has imageio-ffmpeg installed (no PyAV). The encode step at the very end of generation crashed with: ImportError: The `pyav` plugin is not installed. Use `pip install imageio[pyav]` to install it Switch to plugin='FFMPEG' — backed by the bundled imageio-ffmpeg binary that pyproject already requires via imageio[ffmpeg]. libx264 yuv420p is the FFMPEG plugin's default for .mp4, so the explicit pixel_format is dropped (specifying it just produced a "Multiple -pix_fmt options" warning). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The video VAE decode + MP4 encode tail can take 30-90s on top of the denoise loop, and the toast-style signal_progress() messages don't land in the server log. Add context.logger.info() at: - VAE decode start: latent frame count -> pixel frame count + resolution - MP4 encode start: frames, fps, duration, dimensions - MP4 encode complete: encoded file size - Video saved: final video_name Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After wan_l2v wrote a successful libx264 MP4 to disk, the invocation would hang in DiskVideoFileStorage.save() during the cv2.VideoCapture thumbnail-extraction step. cv2 wheels on this build can't reliably decode our libx264/yuv420p output (most often the wheel was compiled without an h264 decoder, but the failure mode is silent hang rather than a clear error). The net effect: the MP4 ends up in outputs/videos but the queue item never completes, so the frontend spinner spins forever and the gallery doesn't pick up the new entry. Fix: rewrite extract_video_frame and probe_video to try imageio's FFMPEG plugin first (same backend that did the encoding — so reading our own output is guaranteed to work), with cv2 retained only as a fallback for uploaded videos in formats imageio can't decode. Also add fine-grained log lines + exception guards inside DiskVideoFileStorage.save() so a future thumbnail failure can no longer hang the whole save — it now logs a warning and continues, leaving the video record in place even if the thumbnail step errored. With logging at each step (video written, thumbnail written, sidecar written) any future hang will be obvious from the last log line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After wan_l2v wrote its MP4 successfully, the gallery and viewer were never updated: the new video didn't appear and the viewer stayed stuck on the previous "Saving video" progress spinner indefinitely. Root cause: onInvocationComplete.tsx only inspected results for isImageField / isImageFieldCollection. VideoField outputs were silently dropped, so the polymorphic gallery list never invalidated and no auto-switch happened. The viewer therefore kept rendering CurrentImagePreview, whose ImageViewerContext-local $progressEvent / $progressImage atoms intentionally aren't cleared on queue completion when autoSwitch is on — they rely on the new image's DndImage onLoad to clear them, which never fires for a video. Fix: add isVideoField (mirrors isImageField against {video_name}) and plumb video outputs through onInvocationComplete: - getResultVideoDTOs pulls VideoDTOs via getVideoDTOSafe - addVideosToGallery invalidates GalleryItemNameList / GalleryItemList so the polymorphic gallery refetches and the new video shows up - auto-switch dispatches the video name into selection (selection is a polymorphic string[]; useGalleryItemDTO already discriminates by filename extension) The selection change swaps CurrentImagePreview for CurrentVideoPreview, which unmounts the stale progress overlay along with it — so the stuck spinner clears as a side-effect of the auto-switch. Also drops the now-stale @knipignore on getVideoDTOSafe, which has a real consumer now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extracts a single frame from a VideoField input and saves it as a regular ImageDTO via context.images.save, so it appears in the gallery like any other generated image. Primary use case is I2V "shot extension": take the last frame of a Wan-generated clip (default frame_index=-1) and feed it back as the reference image for the next clip, then stitch the MP4s to get videos longer than the model's single-shot frame budget at a given VRAM. Negative frame_index is resolved against the actual decoded frame count via probe_video() rather than passed through to imageio — not all imageio plugins handle index=-1 uniformly, and being explicit lets us emit a precise out-of-range error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Joins two or more videos into a single MP4 with one of three transition modes between consecutive clips: - cut: hard splice, no blending. Total length = sum of inputs. - crossfade: linear A→B dissolve over transition_frames. Each boundary consumes N frames from both surrounding clips, shrinking total length by N per boundary. - fade_through_black: A fades to black, then B fades in. Each boundary consumes N/2 from each side and emits N output frames — total length is preserved. Implementation decodes via imageio's FFMPEG plugin (matching wan_l2v on the encode side) and runs the blends in numpy. All decoded frames are kept in memory at once; fine for the few-hundred-frame I2V chains that motivated this, would want streaming if anyone ever feeds in hour-long uploads. Up-front validation enforces matching dimensions across inputs and checks that each clip has enough frames to spare from its head and tail for the requested transitions — saves a wasted decode pass when the transition window is too wide for one of the clips. Pairs with 'Frame from Video' for I2V shot extension: generate N clips chained via last-frame-as-ref-image, then glue them with a crossfade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The viewer used a chakra <Image src={thumbnail_url}> in the idle (not- playing) state, so once a clip auto-selected after generation the preview snapped from the full-resolution denoise progress image to the small WebP gallery thumbnail upscaled to fit — visibly soft compared to what the user was watching seconds earlier. Switch to a single <video> element that spans both states: - idle: muted, no controls, preload="metadata". With no `poster` attr the browser decodes and shows the video's actual first frame at full resolution (this is the documented HTMLVideoElement default). - playing: same DOM node with controls+audio toggled on, kicked off via ref.play(). No reload between states — the decoded buffer carries over. `key={videoName}` swaps the element cleanly when the user moves to a different clip, dropping any in-progress playback state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous empty_cache() fix (53b2f4d) was insufficient. unlock() only decrements the cache record's lock counter — the weights stay on GPU until the cache's automatic offload decides to free them on the next lock(). That heuristic uses ``torch.cuda.memory_allocated() - working_mem`` to estimate free space, which under-frees when the previous denoise step's workspace activations are still allocated alongside the just-unlocked expert. The user-visible symptom was a log line like Loaded model '...:transformer' onto cuda device in 0.37s. Total model size: 9203.13MB, VRAM: 2381.18MB (25.9%) for the incoming low-noise expert, while the high-noise expert continued to hold ~9 GB of VRAM. The swapper now stashes the LoadedModel info handle and, on each swap, explicitly invokes ``cached_model.full_unload_from_vram()`` on the outgoing expert before locking the incoming one. This sidesteps the heuristic and guarantees the previous expert's weights leave GPU before partial_load_to_vram measures available room. The access path ``info._cache_record.cached_model`` reaches into a private attribute — there is no public LoadedModel API for "unload from VRAM but keep in RAM" today, and a broader backend refactor felt out of scope. The call is wrapped in getattr/try-except and pinned by a regression test so a future refactor breaks the test, not the swap. Tests: - Updated existing dual-expert lifecycle test to expect the new full-unload step in the swap log sequence. - New test_outgoing_expert_force_unloaded_from_vram covers the per-swap behavior (outgoing only, no initial unload). - New test_force_unload_failure_does_not_break_swap pins the defensive fallback so swap reliability survives a future LoadedModel refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GalleryImage's modifier-key click handler was reading the legacy imagesApi getImageNames cache to compute range-selection indices, but the gallery grid was switched to the polymorphic galleryApi getGalleryItemNames endpoint (the only source that includes videos). The legacy cache is no longer populated for the grid, so the ordered-name list came back empty and the handler fell into its "no names cached" early-return: if (imageNames.length === 0) { if (!shiftKey && !ctrlKey && !metaKey && !altKey) { dispatch(selectionChanged([imageName])); } return; } making shift- and ctrl-click no-ops. GalleryVideoItem already had the correct reader inlined as a private helper. Hoist it to a shared module (features/gallery/store/selectCachedGalleryItemNames) so both grids use the polymorphic cache, and update GalleryImage to call it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Refresh of the OpenAPI-derived TypeScript bindings against the current backend. No hand edits — this is the output of the typegen step re-run against the Wan video routes and recent backend changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Set TOKENIZERS_PARALLELISM=false at startup (via os.environ.setdefault so users can override) before any HF library is imported. The Rust ``tokenizers`` library warms a thread pool the first time a tokenizer runs — for us that's UMT5 / T5 text encoding during Wan / FLUX / SD3 conditioning. Every subsequent fork() then logs huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... In video generation we fork on every MP4 encode (imageio's FFMPEG plugin uses subprocess.Popen → fork+exec), so this warning lands once per generation in the server log. The advisory is benign — the child correctly falls back to single-threaded tokenization before exec(), and the parent's thread pool is unaffected — but the noise obscures real warnings. Setting the env var before any HF import prevents the thread pool from warming up at all, so the fork detector stays quiet without sacrificing anything: tokenization happens once per generation and isn't a hot path for us. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to 2106f10 — the previous attempt set the env var inside ``run_app()``, which races against any transitive HF import triggered by the console-script's from invokeai.app.run_app import run_app If ``tokenizers`` is imported anywhere in that import chain (directly or via diffusers/transformers re-exports), the library's fork detector registers before our setdefault runs and the warning still fires. Move the setdefault to module level so it executes the instant ``run_app.py`` is loaded — i.e. before the function defs are even parsed, and well before any HF library has a chance to import. Note for testing: jurigged hot-reload only re-runs function bodies, so picking up this fix requires a full server restart, not just a file save under ``--dev-reload``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Use the in-app delete-confirmation dialog (the same Chakra ConfirmationAlertDialog the image flow uses) instead of the browser's window.confirm() prompt. Matches the visual + interaction language of the rest of the gallery and picks up the shared ``shouldConfirmOnDelete`` system preference — flipping the "Don't ask me again" toggle now silences the prompt for both images and videos. Implementation mirrors features/deleteImageModal/ but trimmed: the image dialog computes "usage" (canvas layers, node fields, reference images, upscale source) so the user knows what they'll break. Videos have no analogous attachment points, so the video state machine is a straight confirm-then-delete with no usage analysis. - features/deleteVideoModal/store/state.ts — nanostores atom + an awaitable ``deleteVideosWithDialog`` that opens the dialog and resolves/rejects on confirm/cancel. Skips the dialog entirely when shouldConfirmOnDelete is off. - features/deleteVideoModal/components/DeleteVideoModal.tsx — ConfirmationAlertDialog with the new deleteVideoPermanent message and the shared "Don't ask me again" switch. - GlobalModalIsolator.tsx — mount the new modal alongside DeleteImageModal. - ContextMenuItemDeleteVideo.tsx — call useDeleteVideoModalApi().delete instead of window.confirm + useDeleteVideoMutation. - en.json — added gallery.deleteVideoPermanent, dropped the now-unused gallery.deleteVideoConfirmation. - videos.ts — useDeleteVideoMutation moves into the @knipignore export group since the only call site now uses videosApi.endpoints.deleteVideo.initiate via the modal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The gallery grid subscribes to the polymorphic ``getGalleryItemNames`` RTK Query endpoint (so images and videos interleave by created_at). But ``onInvocationComplete``'s image path only did an optimistic insert into the image-only ``getImageNames`` cache, leaving the polymorphic cache stale — a freshly-generated image landed correctly in board totals and the per-DTO cache, but never showed up in the grid until the user reloaded the page. Mirror the videos path (which has invalidated these tags since the polymorphic endpoint was introduced) and dispatch ``galleryApi.util.invalidateTags(['GalleryItemNameList', 'GalleryItemList'])`` after image outputs are processed. The cost is one extra HTTP round-trip per generation; a future optimization could optimistically splice the new entry into the polymorphic shape, but that requires a different ``insertImageIntoNamesResult`` for the ``GetGalleryItemNamesResult`` shape and is a bigger change. Regression test in onInvocationComplete.test.ts pins the behavior: verifies the invalidation fires on a fake image complete event, and verifies it does NOT fire for denylisted passthrough node types (load_image, image). Confirmed test correctly fails when the fix is reverted via git stash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Self-review pass before re-pinging external reviewers. Five fixes; the three medium ones have user-visible consequences, the two low ones are guard + docstring. 1. videos.py: delete_video no longer swallows service errors into a misleading HTTP 200. Missing DTO -> 404, delete failure -> 500. The prior shape returned 200 with an empty deleted_videos list, which the frontend treated as success, dropped from cache, and left the video on disk — silent data-consistency failure visible only on next page reload. 2. videos.ts: starVideos / unstarVideos invalidate the LIST_TAG-scoped { type: 'VideoList' } entry alongside the per-video and board-affecting tags. Without this, starred_first=true gallery queries kept the just-starred video in its old position until the next list-affecting mutation. Mirrors the delete + upload pattern. 3. wan_denoise.py: _ExpertSwapper.get() stashes _active_device_ctx right after device_ctx.__enter__() succeeds, before attempting the LoRA patcher's __enter__. If the LoRA enter raises, _release() can now actually find the device context and exit it — previously the ctx was unreachable and 8-9 GB of GGUF expert weights stayed pinned to GPU until the model cache LRU evicted them. 4. wan_ideal_dimensions.py: reject sources whose longer side is below the 16-px Wan grid. The downstream max(w, 16) clamp would otherwise silently disconnect the output from the requested aspect ratio (returning 16×16 regardless of the source's actual shape). 6. wan_video_denoise.py: docstring now explains the deliberate absence of denoising_start / denoising_end / initial-latents inputs (video i2v uses reference-frame conditioning, not noise injection; the image denoise node still handles still-image img2img). Tests: - test_device_context_released_when_lora_enter_raises pins #3. - test_input_smaller_than_pixel_grid_rejected pins #4. - test_output_dims_never_zero renamed to test_smallest_valid_input_still_snaps_to_16_grid (now exercises 16×16 rather than 8×8 since the latter is now correctly rejected). All 58 affected backend tests pass, frontend lint clean. Audit note for the PR description (NOT a fix): delete_video's _assert_video_owner permits write access on public boards (mirroring the image router's _assert_image_owner — intentional symmetry). The stricter _assert_video_direct_owner is reserved for board-move ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…video-support' into lstein/feature/wan-video-support

lstein · 2026-05-14T21:04:26Z

TI2V model support is not working properly, so I have converted into a draft while making this fix.

Working now and ready for review.

Pfannkuchensack · 2026-05-14T21:30:08Z

Findings

Medium: invokeai/app/api/routers/boards.py:112-148 — delete_board with include_images=True cascades into videos.delete_videos_on_board(board_id=board_id), which deletes EVERY video on the board regardless of the original uploader. Trigger: a user owns a board with visibility Public (or Shared); other users upload their own videos to that board via the public-board write fallback in invokeai/app/api/routers/videos.py:94-112 and invokeai/app/api/routers/videos.py:175-184. When the board owner later deletes the board with include_images=True, the cascade in invokeai/app/services/videos/videos_default.py calls delete_videos_on_board which iterates ALL board_videos rows for that board_id and unconditionally deletes the records and files — there is no per-record user_id check. Result: the board owner can destroy other users' videos at will by deleting a public/shared board, even though they could not delete those individual videos directly without going through the board-owner branch. This mirrors a pre-existing image pattern, but the branch is the moment that anti-pattern crosses over to videos. The single multiuser test for this path (tests/app/routers/test_boards_multiuser.py:692) only asserts the cascade was invoked; it does not assert it respects per-video ownership. To expose this issue, add a test that has user2 upload a video to a public board owned by user1, has user1 delete that board with include_images=True, and verifies user2's video was NOT destroyed.
Low (i18n): invokeai/frontend/web/src/features/gallery/components/ImageGrid/GalleryItemVideoStarIconButton.tsx:27 — tooltip={videoDTO.starred ? 'Unstar' : 'Star'} ships raw English to the user. Every other star/unstar control in the codebase uses a translation key (e.g. images use gallery.starImage / gallery.unstarImage). Since invokeai/frontend/web/public/locales/en.json already defines the image variants, the fix is either to reuse them or to add gallery.starVideo / gallery.unstarVideo. To expose this issue, prefer adding a vitest assertion on the translation key used by the button if the logic is extractable; otherwise note manual verification for non-English locales.
Low: invokeai/app/api/routers/videos.py:274-296 — delete_videos_from_list re-raises any HTTPException mid-loop (so a 403 on the second item kills the request) but has already deleted earlier items by then. The 403 response body carries no list of deleted_videos, so the frontend cache never learns about the partial deletes — the deleted items reappear in the UI on next refresh and the user repeats the operation, only to be told it succeeded the second time. This mirrors the pre-existing images pattern at invokeai/app/api/routers/images.py:548-575. To expose this issue, add a test that POSTs [owned_video_a, foreign_video_b, owned_video_c] from a non-admin user, asserts the response is 403, and asserts that owned_video_a was nevertheless deleted on disk and in video_records — proving the data/UI drift.

Open Questions

invokeai/app/api/routers/videos.py:54-75 — _assert_video_owner returns success for any user as long as the video sits on a Public board. That is the same semantics images use, so it is consistent, but it means a non-owner can DELETE /v1/videos/i/{name}, PATCH it, star/unstar it. The pattern is documented as intentional in invokeai/app/api/routers/images.py:40-72. Question: is this still desired for videos given they are typically far more expensive to regenerate than images? If yes, leave as is. If no, the fix is narrow.

Residual Risk / Verification

I did not exercise the upload_video route against malformed MP4 containers; _is_accepted_video_upload accepts purely on content_type OR .mp4 extension, then trusts probe_video to fail loudly on non-decodable content. Worth manual verification with a renamed-but-not-MP4 payload to confirm the rejection path returns 415 and unlinks the tmp file.

The video gallery context menu only operated on the single right-clicked item, so selecting multiple videos and hitting the trash icon deleted just the first one. Adds a video-side multi-selection menu mirroring the image one for star/unstar/download/change-board/delete, switched in on selectionCount > 1. Each menu now filters the polymorphic selection to its own kind and labels the action with an explicit count + kind (e.g. "Delete 3 Videos", "Move 2 Images to Board"). The destructive items disable when the kind-filtered subset is empty, so a video-only selection greys out the image menu and vice versa. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pfannkuchensack · 2026-05-14T22:36:34Z

"ValueError: Reference-image dimensions (1024x1024) must match denoise dimensions (1024x480)." That error loops forever with loading state.

Pfannkuchensack · 2026-05-14T22:49:44Z

"ValueError: Reference-image condition has 1 latent frames but the denoise loop expected 21. Ensure the ref-image encoder was called with the same num_frames (81)."
Both has the same Frames (81) setting.
Edit: After not using 1024x1024 it is working.

lstein · 2026-05-15T02:23:47Z

invokeai/app/api/routers/videos.py:54-75 — _assert_video_owner returns success for any user as long as the video sits on a Public board. That is the same semantics images use, so it is consistent, but it means a non-owner can DELETE /v1/videos/i/{name}, PATCH it, star/unstar it. The pattern is documented as intentional in invokeai/app/api/routers/images.py:40-72. Question: is this still desired for videos given they are typically far more expensive to regenerate than images? If yes, leave as is. If no, the fix is narrow.

I think that the behavior of images and videos should be consistent: either both can be deleted from public boards by non-owners, or neither. I'm inclined to maintain the read/write behavior of public boards for the time being, and revisit this after some user experience/feedback.

…ings Finding 1 (Medium): delete_board cascade ignored per-video / per-image ownership, letting a board owner destroy other users' contributions to a public/shared board just by deleting the board with include_images=true. Adds user_id filtering through get_all_board_*_names_for_board and delete_*_on_board (base + sqlite + image wrapper). Non-admin requests pass the requester's id so the SQL WHERE clause narrows the cascade to that user's rows; admins still pass None for the unrestricted path. Other users' content cascades to "uncategorized" via the existing FK on board_videos / board_images. Finding 2 (Low, i18n): GalleryItemStarIconButton and GalleryItemVideoStarIconButton shipped raw English "Star"/"Unstar" tooltips. Both now use the gallery.starImage / starVideo translation keys. Finding 3 (Low): delete_videos_from_list and delete_images_from_list re-raised HTTPException mid-loop, throwing away the response payload for items already deleted before the foreign name was hit. The frontend cache never learned about those partial successes, so deleted records reappeared in the UI until the next manual refresh. Both routes now skip auth-failed items in-loop and return 200 with the partial-success list. Residual: adds a test that an upload with an .mp4 extension but non-decodable bytes (a) reaches probe_video, (b) surfaces 415, (c) unlinks the streamed-to-disk temp file so the server doesn't leak storage on garbage uploads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lstein · 2026-05-15T02:40:38Z

@Pfannkuchensack b09e65c addresses the issues raised in the code review. However I'm still working on the infinite loading loop you discovered.

The committed schema was stale relative to the current server (missing the utilities/expand-prompt and utilities/image-to-prompt endpoints, the ModelRecordOrderBy / SQLiteDirection list params, and the Wan / QwenImage / QwenVLEncoder config variants this branch adds). Regenerated via the same command the new openapi-checks workflow uses so the diff CI is empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends an existing video by extracting its penultimate frame, running it through Wan 2.2 I2V A14B + the Lightning LoRA pair to generate a new clip, and concatenating the result onto the source with a short crossfade. Cleaned per the default-workflows README: stripped value references on the four model loader fields and both Lightning LoRA fields so the workflow ships without keys/hashes for user-installed resources, gave the LoRA nodes "Apply LoRA (High)" / "(Low)" labels matching the existing Lightning default, remapped six stale exposedFields entries that pointed to template LoRA IDs no longer present in the graph, and synced the wan_video_denoise num_frames default to the value driven by the connected integer node. Tagged with both Text to Video and Image to Video so it surfaces under either filter in the Workflow Library. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…video-support # Conflicts: # invokeai/frontend/web/openapi.json

lstein · 2026-05-15T03:17:28Z

"ValueError: Reference-image dimensions (1024x1024) must match denoise dimensions (1024x480)." That error loops forever with loading state.

Can you describe how to reproduce this bug? I haven't been able to hit it.

- Restore auto-select-on-startup and on-board-switch: the polymorphic getGalleryItemNames endpoint replaced getImageNames as the grid's source of truth, so appStarted and boardIdSelected now wait on / read that cache instead of timing out forever. - Delete-then-select: video delete used to clear selection to null; image delete read a cache that's no longer warmed. Both now snapshot the gallery list before deletion and advance to the adjacent surviving item (prev > next > null) via a shared pickSelectionAfterDelete helper. - Video Viewer: right-aligned action bar with Open in new tab, Copy frame, Download, Delete, and a labelled Close video player button that only appears while playback is active. Copy uses canvas + ClipboardItem since video MIME types aren't supported cross-browser. - Next/prev arrows + galleryNav hotkeys now work when a video is in the Viewer (previously image-only). - Video context menu uses full-width text MenuItems instead of the cramped icon group, and gains an Open in new tab entry.

- Bulk video drag: introduced multipleVideoDndSource so a multi-selection dragged from a video thumbnail moves every selected video, not just the first. The whitelist in useDndMonitor.ts also needed updating — without it the monitor's canMonitor gate silently dropped the new source type. - Mixed selections: both the multi-image and multi-video drag payloads now carry image_names + video_names side-by-side, so dragging from either kind in a mixed selection dispatches addImagesToBoard + addVideosToBoard together. Previously the image side leaked video names into image_names and the image router 404'd on each one. - Bulk video helpers: added addVideosToBoard / removeVideosFromBoard that fan out over the existing singular video router endpoint (no batch endpoint exists yet) — mirrors the change-board modal's existing loop. - Shift-click range selection: selectCachedGalleryItemNames now looks up the cache entry matching the gallery's current query args instead of taking the first entry from selectInvalidatedBy. RTK Query keeps unused entries warm for 60s after a board switch, and the old "first wins" behavior frequently landed on a stale board's name list, making shift-click silently no-op until a delete/move forced a refetch.

Pfannkuchensack · 2026-05-15T19:30:08Z

"ValueError: Reference-image dimensions (1024x1024) must match denoise dimensions (1024x480)." That error loops forever with loading state.

Can you describe how to reproduce this bug? I haven't been able to hit it.

i use the wan i2i workflow and added a 1024x1024 as ref image and had 1024x480 denoise target. The loop loading is something that happens because the error fails to update the state right (i think)

lstein and others added 30 commits May 9, 2026 10:33

chore(frontend): typegen

b9b0665

lstein and others added 11 commits May 13, 2026 23:36

Merge branch 'main' into lstein/feature/wan-video-support

5d82828

chore(frontend): typegen

ca2a2cd

Merge remote-tracking branch 'refs/remotes/origin/lstein/feature/wan-…

aafc871

…video-support' into lstein/feature/wan-video-support

lstein marked this pull request as ready for review May 14, 2026 20:17

Pfannkuchensack mentioned this pull request May 14, 2026

Feat: canvas project boards #9170

Draft

5 tasks

Merge branch 'main' into lstein/feature/wan-video-support

48e9ec0

lstein and others added 2 commits May 14, 2026 22:33

Merge branch 'main' into lstein/feature/wan-video-support

4883850

lstein and others added 3 commits May 14, 2026 22:46

Merge remote-tracking branch 'upstream/main' into lstein/feature/wan-…

15a7657

…video-support # Conflicts: # invokeai/frontend/web/openapi.json

lstein added 2 commits May 15, 2026 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Video generation#9163

feat: Video generation#9163
lstein wants to merge 84 commits into
invoke-ai:mainfrom
lstein:lstein/feature/wan-video-support

lstein commented May 13, 2026 •

edited

Loading

Uh oh!

lstein commented May 14, 2026

Uh oh!

Pfannkuchensack commented May 14, 2026

Uh oh!

Pfannkuchensack commented May 14, 2026

Uh oh!

Pfannkuchensack commented May 14, 2026 •

edited

Loading

Uh oh!

lstein commented May 15, 2026

Uh oh!

lstein commented May 15, 2026

Uh oh!

lstein commented May 15, 2026

Uh oh!

Pfannkuchensack commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lstein commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Important notes

Testing plan

Getting Started Hints

Uh oh!

lstein commented May 14, 2026

Uh oh!

Pfannkuchensack commented May 14, 2026

Findings

Open Questions

Residual Risk / Verification

Uh oh!

Pfannkuchensack commented May 14, 2026

Uh oh!

Pfannkuchensack commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lstein commented May 15, 2026

Uh oh!

lstein commented May 15, 2026

Uh oh!

lstein commented May 15, 2026

Uh oh!

Pfannkuchensack commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lstein commented May 13, 2026 •

edited

Loading

Pfannkuchensack commented May 14, 2026 •

edited

Loading