ECCV 2024 Schedule

Filter Events

SUN 29 SEP

MON 30 SEP

TUE 1 OCT

9 a.m.

Oral 1A: Scene Analysis And Understanding [9:00-10:30]

Orals 9:00-10:20

[9:00] Towards Scene Graph Anticipation

[9:10] OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

[9:20] PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

[9:30] Bi-directional Contextual Attention for 3D Dense Captioning

[9:40] OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

[9:50] ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

[10:00] A Fair Ranking and New Model for Panoptic Scene Graph Generation

[10:10] Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

(ends 10:30 AM)

Oral 1B: Autonomous Driving [9:00-10:30]

Orals 9:00-10:20

[9:00] Making Large Language Models Better Planners with Reasoning-Decision Alignment

[9:10] MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping

[9:20] M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

[9:30] H-V2X: A Large Scale Highway Dataset for BEV Perception

[9:40] Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction

[9:50] DriveLM: Driving with Graph Visual Question Answering

[10:00] RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios

[10:10] Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

(ends 10:30 AM)

Oral 1C: Low-Level Vision And Imaging [9:00-10:30]

Orals 9:00-10:20

[9:00] Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection

[9:10] Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging

[9:20] SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

[9:30] Photon Inhibition for Energy-Efficient Single-Photon Imaging

[9:40] Minimalist Vision with Freeform Pixels

[9:50] Flying with Photons: Rendering Novel Views of Propagating Light

[10:00] A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

[10:10] GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

(ends 10:30 AM)

1:30 p.m.

Oral 2A: Generative Models I [1:30-3:30]

Orals 1:30-3:20

[1:30] EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

[1:40] TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

[1:50] LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

[2:00] FlashTex: Fast Relightable Mesh Texturing with LightControlNet

[2:10] TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

[2:20] LLMGA: Multimodal Large Language Model based Generation Assistant

[2:30] Accelerating Image Generation with Sub-path Linear Approximation Model

[2:40] SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation

[2:50] Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture

[3:00] Zero-Shot Detection of AI-Generated Images

[3:10] Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

(ends 3:30 PM)

Oral 2B: Recognition [1:30-3:30]

Orals 1:30-3:20

[1:30] Efficient Bias Mitigation Without Privileged Information

[1:40] Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

[1:50] MobileNetV4: Universal Models for the Mobile Ecosystem

[2:00] Momentum Auxiliary Network for Supervised Local Learning

[2:10] From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

[2:20] Dataset Enhancement with Instance-Level Augmentations

[2:30] Adaptive Parametric Activation

[2:40] Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

[2:50] Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

[3:00] CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

[3:10] On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

(ends 3:30 PM)

Oral 2C: Multi-View And Visual Odometry [1:30-3:30]

Orals 1:30-3:20

[1:30] Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition

[1:40] COMO: Compact Mapping and Odometry

[1:50] Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss

[2:00] ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

[2:10] SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

[2:20] Six-Point Method for Multi-Camera Systems with Reduced Solution Space

[2:30] Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

[2:40] Grounding Image Matching in 3D with MASt3R

[2:50] ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images

[3:00] Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

[3:10] Camera Calibration using a Collimator System

(ends 3:30 PM)

3:30 p.m.

Keynote:

Synthesia: From computer vision research to real-world AI avatars

Lourdes Agapito · Vittorio Ferrari

(ends 4:30 PM)

WED 2 OCT

9 a.m.

Oral 3A: Datasets And Benchmarking [9:00-10:30]

Orals 9:00-10:20

[9:00] PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

[9:10] UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

[9:20] Towards Model-Agnostic Dataset Condensation by Heterogeneous Models

[9:30] Parrot Captions Teach CLIP to Spot Text

[9:40] Towards Open-ended Visual Quality Comparison

[9:50] VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking

[10:00] Insect Identification in the Wild: The AMI Dataset

[10:10] MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

(ends 10:30 AM)

Oral 3B: Medical And Biological Imaging [9:00-10:30]

Orals 9:00-10:20

[9:00] PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

[9:10] Self-Supervised Video Desmoking for Laparoscopic Surgery

[9:20] CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos

[9:30] Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction

[9:40] Adaptive Correspondence Scoring for Unsupervised Medical Image Registration

[9:50] Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View

[10:00] SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images

[10:10] Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

(ends 10:30 AM)

Oral 3C: Point Clouds [9:00-10:30]

Orals 9:00-10:20

[9:00] HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

[9:10] PointLLM: Empowering Large Language Models to Understand Point Clouds

[9:20] RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation

[9:30] DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

[9:40] KeypointDETR: An End-to-End 3D Keypoint Detector

[9:50] Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather

[10:00] RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation

[10:10] Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration

(ends 10:30 AM)

1:30 p.m.

Oral 4A: Neural 3D Rendering [1:30-3:30]

Orals 1:30-3:20

[1:30] Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

[1:40] Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering

[1:50] Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration

[2:00] FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information

[2:10] RaFE: Generative Radiance Fields Restoration

[2:20] Watch Your Steps: Local Image and Scene Editing by Text Instructions

[2:30] MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

[2:40] RPBG: Towards Robust Neural Point-based Graphics in the Wild

[2:50] Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

[3:00] Learning 3D-aware GANs from Unposed Images with Template Feature Field

[3:10] MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition

(ends 3:30 PM)

Oral 4B: Video Generation / Editing / Prediction [1:30-3:30]

Orals 1:30-3:20

[1:30] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

[1:40] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

[1:50] Efficient Neural Video Representation with Temporally Coherent Modulation

[2:00] Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

[2:10] Video Editing via Factorized Diffusion Distillation

[2:20] ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

[2:30] Audio-Synchronized Visual Animation

[2:40] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

[2:50] MotionDirector: Motion Customization of Text-to-Video Diffusion Models

[3:00] ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model

[3:10] Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

(ends 3:30 PM)

Oral 4C: Humans: Biometrics, Pose And Motion [1:30-3:30]

Orals 1:30-3:20

[1:30] AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

[1:40] Sapiens: Foundation for Human Vision Models

[1:50] POET: Prompt Offset Tuning for Continual Human Action Adaptation

[2:00] Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation

[2:10] SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

[2:20] UGG: Unified Generative Grasping

[2:30] NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

[2:40] Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

[2:50] LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment

[3:00] Controllable Human-Object Interaction Synthesis

[3:10] NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction

(ends 3:30 PM)

3:30 p.m.

Keynote:

Fair, transparent, and accountable AI: What is legally required, what is ethically desired, and what is technically feasible?

Sandra Wachter

(ends 4:30 PM)

THU 3 OCT

9 a.m.

Oral 5A: Segmentation [9:00-10:30]

Orals 9:00-10:20

[9:00] WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

[9:10] AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation

[9:20] CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

[9:30] Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

[9:40] Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels

[9:50] ActionVOS: Actions as Prompts for Video Object Segmentation

[10:00] Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

[10:10] Diffusion Models for Open-Vocabulary Segmentation

(ends 10:30 AM)

Oral 5B: Vision Applications [9:00-10:30]

Orals 9:00-10:20

[9:00] Robust Fitting on a Gate Quantum Computer

[9:10] Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views

[9:20] Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

[9:30] MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery

[9:40] Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception

[9:50] Faceptor: A Generalist Model for Face Perception

[10:00] A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability

[10:10] COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

(ends 10:30 AM)

Oral 5C: Representation Learning [9:00-10:30]

Orals 9:00-10:20

[9:00] PiTe: Pixel-Temporal Alignment for Large Video-Language Model

[9:10] Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

[9:20] Emergent Visual-Semantic Hierarchies in Image-Text Representations

[9:30] Learning Multimodal Latent Generative Models with Energy-Based Prior

[9:40] Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

[9:50] SINDER: Repairing the Singular Defects of DINOv2

[10:00] Denoising Vision Transformers

[10:10] Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking

(ends 10:30 AM)

1:30 p.m.

Oral 6A: Generative Models II [1:30-3:30]

Orals 1:30-3:20

[1:30] Controlling the World by Sleight of Hand

[1:40] Pyramid Diffusion for Fine 3D Large Scene Generation

[1:50] FMBoost: Boosting Latent Diffusion with Flow Matching

[2:00] ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

[2:10] Exact Diffusion Inversion via Bidirectional Integration Approximation

[2:20] Tackling Structural Hallucination in Image Translation with Local Diffusion

[2:30] Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

[2:40] Adversarial Diffusion Distillation

[2:50] Arc2Face: A Foundation Model for ID-Consistent Human Faces

[3:00] Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning

[3:10] OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

(ends 3:30 PM)

Oral 6B: Video Understanding [1:30-3:30]

Orals 1:30-3:20

[1:30] E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

[1:40] Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

[1:50] Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

[2:00] MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment

[2:10] C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

[2:20] LongVLM: Efficient Long Video Understanding via Large Language Models

[2:30] Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

[2:40] Towards Neuro-Symbolic Video Understanding

[2:50] Classification Matters: Improving Video Action Detection with Class-Specific Attention

[3:00] DEVIAS: Learning Disentangled Video Representations of Action and Scene

[3:10] Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

(ends 3:30 PM)

Oral 6C: Vision And Other Modalities [1:30-3:30]

Orals 1:30-3:20

[1:30] GiT: Towards Generalist Vision Transformer through Universal Language Interface

[1:40] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

[1:50] Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

[2:00] MMBENCH: Is Your Multi-Modal Model an All-around Player?

[2:10] Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

[2:20] Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

[2:30] A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

[2:40] HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

[2:50] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

[3:00] uCAP: An Unsupervised Prompting Method for Vision-Language Models

[3:10] BRAVE: Broadening the visual encoding of vision-language models

(ends 3:30 PM)

3:30 p.m.

Keynote:

Is distribution shift still an AI problem?

Sanmi Koyejo

(ends 4:30 PM)

4:30 p.m.

Poster Session 6 [4:30-6:30]

Posters 4:30-6:30

Exact Diffusion Inversion via Bidirectional Integration Approximation

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

Tackling Structural Hallucination in Image Translation with Local Diffusion

Adversarial Diffusion Distillation

Pyramid Diffusion for Fine 3D Large Scene Generation

Controlling the World by Sleight of Hand

Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning

OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Towards Neuro-Symbolic Video Understanding

DEVIAS: Learning Disentangled Video Representations of Action and Scene

Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

LongVLM: Efficient Long Video Understanding via Large Language Models

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

BRAVE: Broadening the visual encoding of vision-language models

MMBENCH: Is Your Multi-Modal Model an All-around Player?

uCAP: An Unsupervised Prompting Method for Vision-Language Models

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°

Tri^{2}-plane: Thinking Head Avatar via Feature Pyramid

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

AnimateMe: 4D Facial Expressions via Diffusion Models

Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes

Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography

Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats

Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM

Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis

Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending

UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation

City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web

Few-shot NeRF by Adaptive Rendering Loss Regularization

BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting

Generalizable Human Gaussians for Sparse View Synthesis

Invertible Neural Warp for NeRF

PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects

Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images

SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization

Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections

3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting

HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes

GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering

EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS

End-to-End Rate-Distortion Optimized 3D Gaussian Representation

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

Human Hair Reconstruction with Strand-Aligned 3D Gaussians

Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image

Lagrangian Hashing for Compressed Neural Field Representations

GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation

LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

Synthesizing Environment-Specific People in Photographs

Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

Shapefusion: 3D localized human diffusion models

Fast Sprite Decomposition from Animated Graphics

Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution

WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation

Dolfin: Diffusion Layout Transformers without Autoencoder

MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes

RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion

Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds

FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy

SEED: A Simple and Effective 3D DETR in Point Clouds

ProtoComp: Diverse Point Cloud Completion with Controllable Prototype

CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation

Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes

Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains

Multi-modal Relation Distillation for Unified 3D Representation Learning

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Single-Photon 3D Imaging with Equi-Depth Photon Histograms

Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment

SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes

Leveraging scale- and orientation-covariant features for planar motion estimation

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision

TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly

SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction

VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

Human Pose Recognition via Occlusion-Preserving Abstract Images

RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark

6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

On the Utility of 3D Hand Poses for Action Recognition

Multi-Person Pose Forecasting with Individual Interaction Perceptron and Prior Learning

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

Revisit Self-supervision with Local Structure-from-Motion

AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation

High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior

Weakly-supervised Camera Localization by Ground-to-satellite Image Registration

Benchmarking the Robustness of Cross-view Geo-localization Models

Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training

LEROjD: Lidar Extended Radar-Only Object Detection

Towards Stable 3D Object Detection

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

ADMap: Anti-disturbance Framework for Vectorized HD Map Construction

Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction

CarFormer: Self-Driving with Learned Object-Centric Representations

DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction

NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving

Visual Relationship Transformation

Local All-Pair Correspondence for Point Tracking

Un-EVIMO: Unsupervised Event-based Independent Motion Segmentation

Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo

Physical-Based Event Camera Simulator

REDIR: Refocus-free Event-based De-occlusion Image Reconstruction

Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising

Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Self-Supervised Audio-Visual Soundscape Stylization

TC4D: Trajectory-Conditioned Text-to-4D Generation

LivePhoto: Real Image Animation with Text-guided Motion Control

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Photorealistic Video Generation with Diffusion Models

High-Fidelity and Transferable NeRF Editing by Frequency Decomposition

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

Editable Image Elements for Controllable Synthesis

Implicit Style-Content Separation using B-LoRA

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

EraseDraw : Learning to Insert Objects by Erasing Them from Images

Text2Place: Affordance-aware Text Guided Human Placement

ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation

Label-free Neural Semantic Image Synthesis

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Context Diffusion: In-Context Aware Image Generation

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis

SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

Large-scale Reinforcement Learning for Diffusion Models

Latent Guard: a Safety Framework for Text-to-image Generation

Arc2Face: A Foundation Model for ID-Consistent Human Faces

GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images

Closed-Loop Unsupervised Representation Disentanglement with

β

-VAE Distillation and Diffusion Probabilistic Feedback

Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion

Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

FMBoost: Boosting Latent Diffusion with Flow Matching

AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation

Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model

LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery

Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution

Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow

DualDn: Dual-domain Denoising via Differentiable ISP

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation

Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery

Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images

Energy-induced Explicit quantification for Multi-modality MRI fusion

WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields

Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification

Enhancing Tampered Text Detection through Frequency Feature Fusion and Decomposition

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing

Personalized Privacy Protection Mask Against Unauthorized Facial Recognition

GRAPE: Generalizable and Robust Multi-view Facial Capture

Seeing Faces in Things: A Model and Dataset for Pareidolia

Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers

OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Upper-body Hierarchical Graph for Skeleton Based Emotion Recognition in Assistive Driving

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast

Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition

Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment

Classification Matters: Improving Video Action Detection with Class-Specific Attention

HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

Appearance-based Refinement for Object-Centric Motion Segmentation

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Fine-grained Dynamic Network for Generic Event Boundary Detection

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

Self-supervised visual learning from interactions with objects

Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval

Video Question Answering with Procedural Programs

ViLA: Efficient Video-Language Alignment for Video Question Answering

ST-LLM: Large Language Models Are Effective Temporal Learners

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Nonverbal Interaction Detection

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Human-in-the-Loop Visual Re-ID for Population Size Estimation

PreLAR: World Model Pre-training with Learnable Action Representation

Learning to Build by Building Your Own Instructions

Situated Instruction Following

Where am I? Scene Retrieval with Language

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

SegPoint: Segment Any Point Cloud via Large Language Model

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images

BLINK: Multimodal Large Language Models Can See but Not Perceive

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Teach CLIP to Develop a Number Sense for Ordinal Regression

Common Sense Reasoning for Deep Fake Detection

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Evaluating Text-to-Visual Generation with Image-to-Text Generation

DOCCI: Descriptions of Connected and Contrasting Images

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

Conceptual Codebook Learning for Vision-Language Models

Do Generalised Classifiers really work on Human Drawn Sketches?

3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

Discovering Unwritten Visual Classifiers with Large Language Models

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

Rotary Position Embedding for Vision Transformer

Multi-branch Collaborative Learning Network for 3D Visual Grounding

SILC: Improving Vision Language Pretraining with Self-Distillation

LiteSAM is Actually what you Need for segment Everything

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

Click Prompt Learning with Optimal Transport for Interactive Segmentation

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Segment and Recognize Anything at Any Granularity

SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images

Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation

AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation

SAM-guided Graph Cut for 3D Instance Segmentation

Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation

Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection

Shifted Autoencoders for Point Annotation Restoration in Object Counting

Learning Camouflaged Object Detection from Noisy Pseudo Label

Just a Hint: Point-Supervised Camouflaged Object Detection

Rectify the Regression Bias in Long-Tailed Object Detection

PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Visible and Clear: Finding Tiny Objects in Difference Map

IRGen: Generative Modeling for Image Retrieval

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation

Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

BugNIST - a Large Volumetric Dataset for Detection under Domain Shift

AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset

GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

Attention Beats Linear for Fast Implicit Neural Representation Generation

OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

AttnZero: Efficient Attention Discovery for Vision Transformers

Isomorphic Pruning for Vision Models

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Robustness Tokens: Towards Adversarial Robustness of Transformers

Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration

Neural Spectral Decomposition for Dataset Distillation

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

Adaptive Multi-head Contrastive Learning

Unsqueeze [CLS] Bottleneck to Learn Rich Representations

Improving Zero-Shot Generalization for CLIP with Variational Adapter

Learning to Obstruct Few-Shot Image Classification over Restricted Classes

Improving Hyperbolic Representations via Gromov-Wasserstein Regularization

HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions

Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density

SCOD: From Heuristics to Theory

LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration

SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Labeled Data Selection for Category Discovery

PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation

CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning

Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling

MagMax: Leveraging Model Merging for Seamless Continual Learning

Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning

Learning to Unlearn for Robust Machine Unlearning

UNIC: Universal Classification Models via Multi-teacher Distillation

Distributed Active Client Selection With Noisy Clients Using Model Association Scores

Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching

FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning

Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge

Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting

A high-quality robust diffusion framework for corrupted dataset

Similarity of Neural Architectures using Adversarial Attack Transferability

Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data

Resilience of Entropy Model in Distributed Neural Networks

WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning

Instant 3D Human Avatar Generation using Image Diffusion Models

(ends 6:30 PM)

FRI 4 OCT

8:30 a.m.

Oral 7A: Learning Architectures, Transfer, Continual And Long-Tail [8:30-10:30]

Orals 8:30-10:10

[8:30] On the Topology Awareness and Generalization Performance of Graph Neural Networks

[8:40] Improving Knowledge Distillation via Regularizing Feature Direction and Norm

[8:50] Spline-based Transformers

[9:00] Anytime Continual Learning for Open Vocabulary Classification

[9:10] Weighted Ensemble Models Are Strong Continual Learners

[9:20] COD: Learning Conditional Invariant Representation for Domain Adaptation Regression

[9:30] Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning

[9:40] Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

[9:50] Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

[10:00] HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

(ends 10:30 AM)

Oral 7B: Adversarial Learning And Privacy [8:30-10:30]

Orals 8:30-10:00

[8:30] Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

[8:40] Adversarial Robustification via Text-to-Image Diffusion Models

[8:50] Flatness-aware Sequential Learning Generates Resilient Backdoors

[9:00] A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks

[9:10] Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks

[9:20] R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

[9:30] Privacy-Preserving Adaptive Re-Identification without Image Transfer

[9:40] Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

[9:50] Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

(ends 10:30 AM)

Oral 7C: Optimization And Theory [8:30-10:30]

Orals 8:30-10:10

[8:30] A Direct Approach to Viewing Graph Solvability

[8:40] Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees

[8:50] Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering

[9:00] A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures

[9:10] Physics-Based Interaction with 3D Objects via Video Generation

[9:20] Shape from Heat Conduction

[9:30] Rasterized Edge Gradients: Handling Discontinuities Differentially

[9:40] ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

[9:50] Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

[10:00] Model Stock: All we need is just a few fine-tuned models

(ends 10:30 AM)