Visual Generation
arXiv, 2026.
Rolling Sink effectively scales autoregressive video synthesis to ultra-long durations (5-30 minutes) at test time, with consistent subjects, stable colors, and smooth motions.
ICLR, 2026. Oral
EditVerse unifies a diverse range of generation and editing tasks for both images and videos within a single, powerful model.
CVPR, 2025.
We demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models.
NeurIPS, 2025.
Jenga accelerates HunyuanVideo by 4.68-10.35x through dynamic attention carving and progressive resolution generation.
CVPR, 2024. Most Influential CVPR Papers (Paper Digest)
Add 'Lego' attribute to the child, an edited video is generated. Powered by a novel video inversion process and cross-attention control. We also find that a Decoupled-Guidance strategy is essential for video editing.
ICLR, 2024.
Rethinking the inversion process. Boosting Diffusion-based Editing with 3 Lines of Code. Multimodal LLMs
arXiv, 2026.
PS-VAE introduces a semantic-pixel reconstruction objective to regularize the latent space, enabling compression of both semantic information and fine-grained details into a compact representation for SOTA T2I and editing.
CVPR, 2026.
HBridge introduces an asymmetric H-shaped architecture that bridges heterogeneous experts through mid-layer semantic connections, achieving superior unified multimodal understanding and generation with lower training cost.
NeurIPS, 2024. Oral
The slow agent decomposes the task and determines "which actions" to learn. The fast agent writes code and RL configurations for low-level execution.
TPAMI, 2025.
Mining potential of open-source VLMs! Mini-Gemini is a novel framework ranges from 2B to 34B VLMs for hi-resolution image understanding. It has an impressive OCR capability, and can generate HQ images powered by its multi-modal reasoning ability. Project Frame Forward applies changes across entire videos based on one annotated frame and a simple text prompt, bringing the precision of photo editing to video.
Adobe Firefly Image-to-Video turns static images into animated video clips with AI-powered motion, depth, and cinematic flair.
|