Skip to yearly menu bar
Skip to main content
Main Navigation
Select Year: (2024)
2026
2024
2022
Create Profile
Reset Password
My Stuff
Login
Getting Started
Schedule
Main Conference
Keynotes
Orals
Papers
Paper Awards
Workshops
Tutorials
Sponsors
Organizers
Help
Browse
Visualization
mini
compact
topic
detail
Showing papers for
.
×
×
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
4D Contrastive Superflows are Dense 3D Representation Learners
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation
Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance
Modeling and Driving Human Body Soundfields through Acoustic Primitives
Motion Mamba: Efficient and Long Sequence Motion Generation
Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation
SAGS: Structure-Aware 3D Gaussian Splatting
MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing
3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views
Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs
Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models
Disentangling Masked Autoencoders for Unsupervised Domain Generalization
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos
Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition
MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description
BRAVE: Broadening the visual encoding of vision-language models
Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation
SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction
CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance
OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations
MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation
High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs
AFreeCA: Annotation-Free Counting for All
Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap
LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation
Motion and Structure from Event-based Normal Flow
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion
DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching
When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery
Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation
GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
Merlin: Empowering Multimodal LLMs with Foresight Minds
E.T. the Exceptional Trajectory: Text-to-camera-trajectory generation with character awareness
Nuvo: Neural UV Mapping for Unruly 3D Representations
Towards Neuro-Symbolic Video Understanding
SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Diffusion Bridges for 3D Point Cloud Denoising
AttnZero: Efficient Attention Discovery for Vision Transformers
Auto-GAS: Automated Proxy Discovery for Training-free Generative Architecture Search
Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search
Spectral Subsurface Scattering for Material Classification
HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance
CarFormer: Self-Driving with Learned Object-Centric Representations
Text-Guided Video Masked Autoencoder
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting
Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
EvSign: Sign Language Recognition and Translation with Streaming Events
MetaAug: Meta-Data Augmentation for Post-Training Quantization
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning
UNIKD: UNcertainty-Filtered Incremental Knowledge Distillation for Neural Implicit Representation
PartSTAD: 2D-to-3D Part Segmentation Task Adaptation
FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
Cross-Input Certified Training for Universal Perturbations
Rethinking and Improving Visual Prompt Selection for In-Context Learning Segmentation Framework
LiDAR-Event Stereo Fusion with Hallucinations
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation
Revisiting Supervision for Continual Representation Learning
Dolphins: Multimodal Language Model for Driving
MMBENCH: Is Your Multi-Modal Model an All-around Player?
HUMOS: Human Motion Model Conditioned on Body Shape
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds
Unsupervised Exposure Correction
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
External Knowledge Enhanced 3D Scene Generation from Sketch
GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting
Frequency-Spatial Entanglement Learning for Camouflaged Object Detection
3D Congealing: 3D-Aware Image Alignment in the Wild
Adversarial Robustification via Text-to-Image Diffusion Models
CoMo: Controllable Motion Generation through Language Guided Pose Code Editing
MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction
Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment
VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective
Benchmarking the Robustness of Cross-view Geo-localization Models
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Model Stock: All we need is just a few fine-tuned models
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis
Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision
Formula-Supervised Visual-Geometric Pre-training
MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment
DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding
Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection
SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow
TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing
Robust Fitting on a Gate Quantum Computer
Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics
Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement
Large-scale Reinforcement Learning for Diffusion Models
RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation
3D Single-object Tracking in Point Clouds with High Temporal Variation
Self-supervised Shape Completion via Involution and Implicit Correspondences
Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization
Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems
Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion
iHuman: Instant Animatable Digital Humans From Monocular Videos
LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression
Energy-induced Explicit quantification for Multi-modality MRI fusion
Characterizing Model Robustness via Natural Input Gradients
ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement
GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis
SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments
Resolving Scale Ambiguity in Multi-view 3D Reconstruction using Dual-Pixel Sensors
FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection
BugNIST - a Large Volumetric Dataset for Detection under Domain Shift
Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
See and Think: Embodied Agent in Virtual Environment
Scalar Function Topology Divergence: Comparing Topology of 3D Objects
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
Towards Robust Full Low-bit Quantization of Super Resolution Networks
When Do We Not Need Larger Vision Models?
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
GVGEN: Text-to-3D Generation with Volumetric Representation
Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
UNIC: Universal Classification Models via Multi-teacher Distillation
MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References
ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
PointNeRF++: A multi-scale, point-based Neural Radiance Field
Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
Differentiable Convex Polyhedra Optimization from Multi-view Images
WHAC: World-grounded Humans and Cameras
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
V-IRL: Grounding Virtual Intelligence in Real Life
SENC: Handling Self-collision in Neural Cloth Simulation
TrojVLM: Backdoor Attack Against Vision Language Models
Dataset Growth
m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos
ReMamber: Referring Image Segmentation with Mamba Twister
Plain-Det: A Plain Multi-Dataset Object Detector
Pix2Gif: Motion-Guided Diffusion for GIF Generation
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization
Plug-and-Play Learned Proximal Trajectory for 3D Sparse-View X-Ray Computed Tomography
LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation
Beta-Tuned Timestep Diffusion Model
Bayesian Evidential Deep Learning for Online Action Detection
Local All-Pair Correspondence for Point Tracking
Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations
SEED: A Simple and Effective 3D DETR in Point Clouds
Intrinsic Single-Image HDR Reconstruction
DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution
Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification
LaRa: Efficient Large-Baseline Radiance Fields
XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution
MobileNetV4: Universal Models for the Mobile Ecosystem
Efficient Snapshot Spectral Imaging: Calibration-Free Parallel Structure with Aperture Diffraction Fusion
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation
MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection
Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather
DiffiT: Diffusion Vision Transformers for Image Generation
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors
Prioritized Semantic Learning for Zero-shot Instance Navigation
Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats
Can OOD Object Detectors Learn from Foundation Models?
2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction
RadEdit: stress-testing biomedical vision models via diffusion image editing
Towards Real-world Event-guided Low-light Video Enhancement and Deblurring
Referring Atomic Video Action Recognition
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks
SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
MyVLM: Personalizing VLMs for User-Specific Queries
SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark
AMEGO: Active Memory from long EGOcentric videos
Camera-LiDAR Cross-modality Gait Recognition
Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction
Adaptive Correspondence Scoring for Unsupervised Medical Image Registration
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
An Adaptive Screen-Space Meshing Approach for Normal Integration
Collaborative Control for Geometry-Conditioned PBR Image Generation
Open-set Domain Adaptation via Joint Error based Multi-class Positive and Unlabeled Learning
Quantization-Friendly Winograd Transformations for Convolutional Neural Networks
Look Around and Learn: Self-Training Object Detection by Exploration
Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model
Regularizing Dynamic Radiance Fields with Kinematic Fields
SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving
Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss
DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators
Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks
Large Motion Model for Unified Multi-Modal Motion Generation
Memory-Efficient Fine-Tuning for Quantized Diffusion Model
WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Unified Local-Cloud Decision-Making via Reinforcement Learning
Think before Placement: Common Sense Enhanced Transformer for Object Placement
The Hard Positive Truth about Vision-Language Compositionality
Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing
Concise Plane Arrangements for Low-Poly Surface and Volume Modelling
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning
3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting
Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation
AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion
Wavelength-Embedding-guided Filter-Array Transformer for Spectral Demosaicing
GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views
Efficient Bias Mitigation Without Privileged Information
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
Towards Open-Ended Visual Recognition with Large Language Models
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model
IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception
On the Utility of 3D Hand Poses for Action Recognition
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
IRGen: Generative Modeling for Image Retrieval
A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting
LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow
VISA: Reasoning Video Object Segmentation via Large Language Model
Learning Representations of Satellite Images From Metadata Supervision
Adaptive Parametric Activation
Scaling Backwards: Minimal Synthetic Pre-training?
Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction
Towards Multi-modal Transformers in Federated Learning
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion
Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning
FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information
General and Task-Oriented Video Segmentation
Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation
Benchmarking Object Detectors with COCO: A New Path Forward
Diffusion Model is a Good Pose Estimator from 3D RF-Vision
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues
Grounding Language Models for Visual Entity Recognition
Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy
Learning 3D-aware GANs from Unposed Images with Template Feature Field
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning
DεpS: Delayed ε-Shrinking for Faster Once-For-All Training
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Human Hair Reconstruction with Strand-Aligned 3D Gaussians
SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders
Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection
Global-to-Pixel Regression for Human Mesh Recovery
CIC-BART-SSA: : Controllable Image Captioning with Structured Semantic Augmentation
Rethinking Image Super Resolution from Training Data Perspectives
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
Interactive 3D Object Detection with Prompts
Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams
Neural Volumetric World Models for Autonomous Driving
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation
ControlLLM: Augment Language Models with Tools by Searching on Graphs
Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration
Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer
Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation
Uni3DL: A Unified Model for 3D Vision-Language Understanding
G3R: Gradient Guided Generalizable Reconstruction
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning
HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization
Invertible Neural Warp for NeRF
AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Language-Image Pre-training with Long Captions
SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
CoReS: Orchestrating the Dance of Reasoning and Segmentation
MambaIR: A Simple Baseline for Image Restoration with State-Space Model
EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models
I Can't Believe It's Not Scene Flow!
Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
Bi-directional Contextual Attention for 3D Dense Captioning
Scalable Group Choreography via Variational Phase Manifold Learning
Quality Assured: Rethinking Annotation Strategies in Imaging AI
Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels
TPA3D: Triplane Attention for Fast Text-to-3D Generation
Augmented Neural Fine-tuning for Efficient Backdoor Purification
Human Pose Recognition via Occlusion-Preserving Abstract Images
AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling
Retrieval Robust to Object Motion Blur
Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction
Occlusion-Aware Seamless Segmentation
TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts
Diffusion Models for Open-Vocabulary Segmentation
Rethinking Unsupervised Outlier Detection via Multiple Thresholding
OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection
Stream Query Denoising for Vectorized HD-Map Construction
Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation
SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting
WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation
Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
Quantized Prompt for Efficient Generalization of Vision-Language Models
Modality Translation for Object Detection Adaptation without forgetting prior knowledge
How Video Meetings Change Your Expression
Audio-driven Talking Face Generation with Stabilized Synchronization Loss
Learning to Obstruct Few-Shot Image Classification over Restricted Classes
Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation
L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
Distilling Diffusion Models into Conditional GANs
UMBRAE: Unified Multimodal Brain Decoding
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting
Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
Multiscale Graph Texture Network
LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping
Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing
Blind image deblurring with noise-robust kernel estimation
Free-Viewpoint Video of Outdoor Sports Using a Drone
RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency
Binomial Self-compensation for Motion Error in Dynamic 3D Scanning
Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation
Momentum Auxiliary Network for Supervised Local Learning
HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion
Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation
Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains
PQ-SAM: Post-training Quantization for Segment Anything Model
COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation
Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation
TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
Improving Zero-Shot Generalization for CLIP with Variational Adapter
LaWa: Using Latent Space for In-Generation Image Watermarking
Topology-Preserving Downsampling of Binary Images
Cocktail Universal Adversarial Attack on Deep Neural Networks
Hypernetworks for Generalizable BRDF Representation
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders
Classification Matters: Improving Video Action Detection with Class-Specific Attention
Improving Medical Multi-modal Contrastive Learning with Expert Annotations
Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation
Leveraging temporal contextualization for video action recognition
AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale
ZigMa: A DiT-style Zigzag Mamba Diffusion Model
Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time
Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries
MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
Data Collection-free Masked Video Modeling
Resilience of Entropy Model in Distributed Neural Networks
Implicit Concept Removal of Diffusion Models
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Restoring Images in Adverse Weather Conditions via Histogram Transformer
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer
NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis
G2fR: Frequency Regularization in Grid-based Feature Encoding Neural Radiance Fields
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Generating 3D House Wireframes with Semantics
SegPoint: Segment Any Point Cloud via Large Language Model
Navigation Instruction Generation with BEV Perception and Large Language Models
The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally
Eliminating Feature Ambiguity for Few-Shot Segmentation
Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation
GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator
BLINK: Multimodal Large Language Models Can See but Not Perceive
PreLAR: World Model Pre-training with Learnable Action Representation
Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot
Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions
FreestyleRet: Retrieving Images from Style-Diversified Queries
Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal
ReGround: Improving Textual and Spatial Grounding at No Cost
CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos
Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting
Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders
Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation
Image Demoireing in RAW and sRGB Domains
Reliability in Semantic Segmentation: Can We Use Synthetic Data?
Prompting Future Driven Diffusion Model for Hand Motion Prediction
Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning
3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views
Lazy Diffusion Transformer for Interactive Image Editing
Robust Calibration of Large Vision-Language Adapters
Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation
Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training
AugDETR: Improving Multi-scale Learning for Detection Transformer
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
SIGMA: Sinkhorn-Guided Masked Video Modeling
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams
Understanding Physical Dynamics with Counterfactual World Modeling
SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild
VideoMamba: Spatio-Temporal Selective State Space Model
Text to Layer-wise 3D Clothed Human Generation
Fully Sparse 3D Occupancy Prediction
CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field
High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding
PointLLM: Empowering Large Language Models to Understand Point Clouds
Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation
Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis
AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation
Spatially-Variant Degradation Model for Dataset-free Super-resolution
Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence
SUMix: Mixup with Semantic and Uncertain Information
Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation
EAFormer: Scene Text Segmentation with Edge-Aware Transformers
DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction
LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation
Upper-body Hierarchical Graph for Skeleton Based Emotion Recognition in Assistive Driving
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
Zero-Shot Detection of AI-Generated Images
Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training
Exploring Guided Sampling of Conditional GANs
TCC-Det: Temporarily consistent cues for weakly-supervised 3D detection
Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis
OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection
Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation
Kalman-Inspired Feature Propagation for Video Face Super-Resolution
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
VideoMamba: State Space Model for Efficient Video Understanding
Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video
Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models
Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals
Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models
Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection
FreeAugment: Data Augmentation Search Across All Degrees of Freedom
I2-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance
SOS: Segment Object System for Open-World Instance Segmentation With Object Priors
Lagrangian Hashing for Compressed Neural Field Representations
Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis
Gaze Target Detection Based on Head-Local-Global Coordination
3DSA:Multi-View 3D Human Pose Estimation With 3D Space Attention Mechanisms
An Economic Framework for 6-DoF Grasp Detection
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery
Multi-Label Cluster Discrimination for Visual Representation Learning
Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion
CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks
Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds
RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation
StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models
Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions
Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences
Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective
Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation
MagicEraser: Erasing Any Objects via Semantics-Aware Control
Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation
SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images
NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities
3D Small Object Detection with Dynamic Spatial Pruning
Semantically Guided Representation Learning For Action Anticipation
MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory
ScanTalk: 3D Talking Heads from Unregistered Scans
FreeInit: Bridging Initialization Gap in Video Diffusion Models
Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching
Controllable Navigation Instruction Generation with Chain of Thought Prompting
TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning
LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment
EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere
SuperGaussian: Repurposing Video Models for 3D Super Resolution
Towards Model-Agnostic Dataset Condensation by Heterogeneous Models
Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction
D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction
Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution
Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation
UniFS: Universal Few-shot Instance Perception with Point Representations
Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation
Physics-Based Interaction with 3D Objects via Video Generation
Taming Latent Diffusion Model for Neural Radiance Field Inpainting
Shedding More Light on Robust Classifiers under the lens of Energy-based Models
CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians
Unleashing the Power of Prompt-driven Nucleus Instance Segmentation
FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions
3DEgo: 3D Editing on the Go!
Domain-adaptive Video Deblurring via Test-time Blurring
NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving
Progressive Pretext Task Learning for Human Trajectory Prediction
Hyperion – A fast, versatile symbolic Gaussian Belief Propagation framework for Continuous-Time SLAM
Isomorphic Pruning for Vision Models
Reprojection Errors as Prompts for Efficient Scene Coordinate Regression
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation
DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing
VideoClusterNet: Self-Supervised and Adaptive Face Clustering for Videos
Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack
Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
Cross-Domain Learning for Video Anomaly Detection with Limited Supervision
Unsupervised Multi-modal Medical Image Registration via Invertible Translation
CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model
SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
View Selection for 3D Captioning via Diffusion Ranking
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting
Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models
BeNeRF:Neural Radiance Fields from a Single Blurry Image and Event Stream
DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment
SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis
PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture
PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection
Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach
Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction
BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression
Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection
Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection
CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization
SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models
S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing
ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation
OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks
Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
Click Prompt Learning with Optimal Transport for Interactive Segmentation
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
3D Human Pose Estimation via Non-Causal Retentive Networks
6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry
Enhancing Tampered Text Detection through Frequency Feature Fusion and Decomposition
DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction
Learning Diffusion Models for Multi-View Anomaly Detection
Masked Angle-Aware Autoencoder for Remote Sensing Images
Multi-modal Relation Distillation for Unified 3D Representation Learning
LongVLM: Efficient Long Video Understanding via Large Language Models
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection
Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model
Light-in-Flight for a World-in-Motion
Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels
Learning with Unmasked Tokens Drives Stronger Vision Learners
Efficient Training of Spiking Neural Networks with Multi-Parallel Implicit Stream Architecture
Deep Patch Visual SLAM
LiteSAM is Actually what you Need for segment Everything
GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections
Visual Prompting via Partial Optimal Transport
AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection
Pathformer3D: A 3D Scanpath Transformer for 360° Images
Visual Grounding for Object-Level Generalization in Reinforcement Learning
TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection
SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection
Asymmetric Mask Scheme for Self-Supervised Real Image Denoising
FlexAttention for Efficient High-Resolution Vision-Language Models
EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation
EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Human Motion Generation
Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation
PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving
Temporal Event Stereo via Joint Learning with Stereoscopic Flow
H-V2X: A Large Scale Highway Dataset for BEV Perception
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation
QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images
Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation
InstructIR: High-Quality Image Restoration Following Human Instructions
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
LayoutFlow: Flow Matching for Layout Generation
Making Large Language Models Better Planners with Reasoning-Decision Alignment
Continual Learning for Remote Physiological Measurement: Minimize Forgetting and Simplify Inference
PACE: Pose Annotations in Cluttered Environments
InfMAE: A Foundation Model in The Infrared Modality
Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs
STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians
Robust Incremental Structure-from-Motion with Hybrid Features
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
UniCal: Unified Neural Sensor Calibration
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter
ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions
Trajectory-aligned Space-time Tokens for Few-shot Action Recognition
Synchronization of Projective Transformations
U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation
Insect Identification in the Wild: The AMI Dataset
Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers
CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring
This Probably Looks Exactly Like That: An Invertible Prototypical Network
GenRC: Generative 3D Room Completion from Sparse Image Collections
Towards Open-ended Visual Quality Comparison
EgoPet: Egomotion and Interaction Data from an Animal's Perspective
Neural graphics texture compression supporting random access
Contrastive Learning with Synthetic Positives
GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features
DIM: Dyadic Interaction Modeling for Social Behavior Generation
ControlCap: Controllable Region-level Captioning
MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models
Watch Your Steps: Local Image and Scene Editing by Text Instructions
Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
3x2: 3D Object Part Segmentation by 2D Semantic Correspondences
CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians
Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning
A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties
Fast View Synthesis of Casual Videos with Soup-of-Planes
Confidence Self-Calibration for Multi-Label Class-Incremental Learning
Video Question Answering with Procedural Programs
DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification
Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting
SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields
Representation Enhancement-Stabilization: Reducing Bias-Variance of Domain Generalization
LLMGA: Multimodal Large Language Model based Generation Assistant
Shape from Heat Conduction
Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence
HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion
Towards High-Quality 3D Motion Transfer with Realistic Apparel Animation
AnyHome: Open-Vocabulary Large-Scale Indoor Scene Generation with First-Person View Exploration
Better Call SAL: Towards Learning to Segment Anything in Lidar
DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control
iMatching: Imperative Correspondence Learning
Appearance-based Refinement for Object-Centric Motion Segmentation
Open Panoramic Segmentation
Open Vocabulary Multi-Label Video Classification
Shape-guided Configuration-aware Learning for Endoscopic-image-based Pose Estimation of Flexible Robotic Instruments
MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation
GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image
Efficient Pre-training for Localized Instruction Generation of Procedural Videos
MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution
DEAL: Disentangle and Localize Concept-level Explanations for VLMs
RoadPainter: Points Are Ideal Navigators for Topology transformER
Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
IMMA: Immunizing text-to-image Models against Malicious Adaptation
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow
GeoCalib: Learning Single-image Calibration with Geometric Optimization
3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation
ReMatching: Low-Resolution Representations for Scalable Shape Correspondence
Semicalibrated Relative Pose from an Affine Correspondence and Monodepth
Global Structure-from-Motion Revisited
Gravity-aligned Rotation Averaging with Circular Regression
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments
Quanta Video Restoration
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
A Probability-guided Sampler for Neural Implicit Surface Rendering
CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model
ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
POCA: Post-training Quantization with Temporal Alignment for Codec Avatars
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers
Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture
A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks
HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
Audio-Synchronized Visual Animation
Expressive Whole-Body 3D Gaussian Avatar
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects
PAV: Personalized Head Avatar from Unstructured Video Collection
Strike a Balance in Continual Panoptic Segmentation
MultiDelete for Multimodal Machine Unlearning
Stitched ViTs are Flexible Vision Backbones
Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation
TrajPrompt: Aligning Color Trajectory with Vision-Language Representations
Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis
CountFormer: Multi-View Crowd Counting Transformer
SemReg: Semantics Constrained Point Cloud Registration
You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views
RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception
R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors
ActionVOS: Actions as Prompts for Video Object Segmentation
DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models
One-stage Prompt-based Continual Learning
Unsqueeze [CLS] Bottleneck to Learn Rich Representations
Robust Multimodal Learning via Representation Decoupling
Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment
WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing
Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks
A Direct Approach to Viewing Graph Solvability
Effective Lymph Nodes Detection in CT Scans Using Location Debiased Query Selection and Contrastive Query Representation in Transformer
Look Hear: Gaze Prediction for Speech-directed Human Attention
Raising the Ceiling: Conflict-Free Local Feature Matching with Dynamic View Switching
Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
Parrot Captions Teach CLIP to Spot Text
Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning
Solving Motion Planning Tasks with a Scalable Generative Model
Rotary Position Embedding for Vision Transformer
Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
ReNoise: Real Image Inversion Through Iterative Noising
Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment
Leveraging Thermal Modality to Enhance Reconstruction in Low-Light Conditions
PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery
Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation
Recursive Visual Programming
Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks
Learning to Adapt SAM for Segmenting Cross-domain Point Clouds
Take A Step Back: Rethinking the Two Stages in Visual Reasoning
Human-in-the-Loop Visual Re-ID for Population Size Estimation
Finding Visual Task Vectors
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Tensorial template matching for fast cross-correlation with rotations and its application for tomography
Event Camera Data Dense Pre-training
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
EgoLifter: Open-world 3D Segmentation for Egocentric Perception
MoVideo: Motion-Aware Video Generation with Diffusion Models
ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion
Where am I? Scene Retrieval with Language
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
RangeLDM: Fast Realistic LiDAR Point Cloud Generation
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation
Physically Plausible Color Correction for Neural Radiance Fields
Unifying 3D Vision-Language Understanding via Promptable Queries
LLM as Copilot for Coarse-grained Vision-and-Language Navigation
Revisiting Calibration of Wide-Angle Radially Symmetric Cameras
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution
PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control
MAD-DR: Map Compression for Visual Localization with Matchness Aware Descriptor Dimension Reduction
A New Dataset and Framework for Real-World Blurred Images Super-Resolution
Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction
Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing
Uncertainty-aware sign language video retrieval with probability distribution modeling
NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction
SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Adversarial Prompt Tuning for Vision-Language Models
BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering
A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks
CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation
Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning
Operational Open-Set Recognition and PostMax Refinement
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
Text2Place: Affordance-aware Text Guided Human Placement
REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices
Self-Training Room Layout via Geometry-aware Ray-casting
TAPTR: Tracking Any Point with Transformers as Detection
Adaptive Multi-task Learning for Few-shot Object Detection
Closed-Loop Unsupervised Representation Disentanglement with $\beta$-VAE Distillation and Diffusion Probabilistic Feedback
ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model
Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration
CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection
Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery
Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition
D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On
TC4D: Trajectory-Conditioned Text-to-4D Generation
RAW-Adapter: Adapting Pretrained Visual Model to Camera RAW Images
Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding
Dataset Enhancement with Instance-Level Augmentations
AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models
Personalized Federated Domain-Incremental Learning based on Adaptive Knowledge Matching
ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images
Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery
SLIM: Spuriousness Mitigation with Minimal Human Annotations
Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset
X-Pose: Detecting Any Keypoints
MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition
∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering
DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction
Motion Aware Event Representation-driven Image Deblurring
Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
GroupDiff: Diffusion-based Group Portrait Editing
Privacy-Preserving Adaptive Re-Identification without Image Transfer
Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views
Towards More Practical Group Activity Detection: A New Benchmark and Model
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models
Zero-Shot Image Feature Consensus with Deep Functional Maps
Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views
City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web
Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection
SeiT++: Masked Token Modeling Improves Storage-efficient Training
Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling
ProMerge: Prompt and Merge for Unsupervised Instance Segmentation
Open-Vocabulary Camouflaged Object Segmentation
CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images
PetFace: A Large-Scale Dataset and Benchmark for Animal Identification
A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging
InterFusion: Text-Driven Generation of 3D Human-Object Interaction
GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval
Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition
Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection
Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification
Compositional Substitutivity of Visual Reasoning for Visual Question Answering
DNI: Dilutional Noise Initialization for Diffusion Video Editing
Fully Authentic Visual Question Answering Dataset from Online Communities
Towards Physical World Backdoor Attacks against Skeleton Action Recognition
Active Generation for Image Classification
Panel-Specific Degradation Representation for Raw Under-Display Camera Image Restoration
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
Diffusion-Guided Weakly Supervised Semantic Segmentation
DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration
Real-time Holistic Robot Pose Estimation with Unknown States
Online Vectorized HD Map Construction using Geometry
Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation
Click-Gaussian: Interactive Segmentation to Any 3D Gaussians
Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data
Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion
Improving Virtual Try-On with Garment-focused Diffusion Models
MANIKIN: Biomechanically Accurate Neural Inverse Kinematics for Human Motion Estimation
Disentangled Generation and Aggregation for Robust Radiance Fields
MoAI: Mixture of All Intelligence for Large Language and Vision Models
SMooDi: Stylized Motion Diffusion Model
Online Temporal Action Localization with Memory-Augmented Transformer
JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
Learning Video Context as Interleaved Multimodal Sequences
Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
FLAT: Flux-aware Imperceptible Adversarial Attacks on 3D Point Clouds
Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks
Multi-branch Collaborative Learning Network for 3D Visual Grounding
Progressive Proxy Anchor Propagation for Unsupervised Semantic Segmentation
Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
Revisit Human-Scene Interaction via Space Occupancy
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control
WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model
Mitigating Background Shift in Class-Incremental Semantic Segmentation
Relation DETR: Exploring Explicit Position Relation Prior for Object Detection
BKDSNN: Enhancing the Performance of Learning-based Spiking Neural Networks Training with Blurred Knowledge Distillation
Object-Oriented Anchoring and Modal Alignment in Multimodal Learning
SPIRE: Semantic Prompt-Driven Image Restoration
Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training
SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance
Towards Stable 3D Object Detection
FYI: Flip Your Images for Dataset Distillation
On-the-fly Category Discovery for LiDAR Semantic Segmentation
Dual-Camera Smooth Zoom on Mobile Phones
Attention Decomposition for Cross-Domain Semantic Segmentation
CONDA: Condensed Deep Association Learning for Co-Salient Object Detection.
PolyRoom: Room-aware Transformer for Floorplan Reconstruction
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution
AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors
Improving Video Segmentation via Dynamic Anchor Queries
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Diffusion Models as Optimizers for Efficient Planning in Offline RL
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image
Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective
Flatness-aware Sequential Learning Generates Resilient Backdoors
PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking
HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models
SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning
Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning
An Incremental Unified Framework for Small Defect Inspection
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection
MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation
PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training
Real-time 3D-aware Portrait Editing from a Single Image
Dolfin: Diffusion Layout Transformers without Autoencoder
Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation
Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation
Emergent Visual-Semantic Hierarchies in Image-Text Representations
DriveLM: Driving with Graph Visual Question Answering
Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation
LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors
Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection
Real Appearance Modeling for More General Deepfake Detection
6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
Event Trojan: Asynchronous Event-based Backdoor Attacks
V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception
VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space
CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing
GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding
Any2Point: Empowering Any-modality Transformers for Efficient 3D Understanding
HARIVO: Harnessing Text-to-Image Models for Video Generation
Deep Online Probability Aggregation Clustering
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
Length-Aware Motion Synthesis via Latent Diffusion
Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification
Free Lunch for Gait Recognition: A Novel Relation Descriptor
OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers
An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers
Disentangled Clothed Avatar Generation from Text Descriptions
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
Exemplar-free Continual Representation Learning via Learnable Drift Compensation
Improving image synthesis with diffusion-negative sampling
AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos
FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation
SignGen: End-to-End Sign Language Video Generation with Latent Diffusion
Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems
Idling Neurons, Appropriately Lenient Workload During Fine-tuning Leads to Better Generalization
Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction
S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis
The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior
Accelerating Image Generation with Sub-path Linear Approximation Model
Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations
SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation
Camera Calibration using a Collimator System
GRA: Detecting Oriented Objects through Group-wise Rotating and Attention
Track Everything Everywhere Fast and Robustly
AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion
Label-free Neural Semantic Image Synthesis
Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation
Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts
Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment
FARSE-CNN: Fully Asynchronous, Recurrent and Sparse Event-Based CNN
ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images
Event-Aided Time-To-Collision Estimation for Autonomous Driving
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
The Devil is in the Statistics: Mitigating and Exploiting Statistics Difference for Generalizable Semi-supervised Medical Image Segmentation
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
VEON: Vocabulary-Enhanced Occupancy Prediction
Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models
HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images
Nonverbal Interaction Detection
The Sky's the Limit: Relightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility
DiffFAS: Face Anti-Spoofing via Generative Diffusion Models
Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights
I-MedSAM: Implicit Medical Image Segmentation with Segment Anything
Neural Spectral Decomposition for Dataset Distillation
Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
Region-Adaptive Transform with Segmentation Prior for Image Compression
SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
Cascade Prompt Learning for Visual-Language Model Adaptation
Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion
cDP-MIL: Robust Multiple Instance Learning via Cascaded Dirichlet Process
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition
Delving Deep into Engagement Prediction of Short Videos
CLEO: Continual Learning of Evolving Ontologies
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion
Leveraging scale- and orientation-covariant features for planar motion estimation
MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion
MultiGen: Zero-shot Image Generation from Multi-modal Prompts
Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning
VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors
SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting
Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data
Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction
Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network
AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition
HERGen: Elevating Radiology Report Generation with Longitudinal Data
Labeled Data Selection for Category Discovery
Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation
Dependency-aware Differentiable Neural Architecture Search
CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer
SNeRV: Spectra-preserving Neural Representation for Video
COMO: Compact Mapping and Odometry
SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder
EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation
An Information Theoretical View for Out-Of-Distribution Detection
HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras
Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
SILC: Improving Vision Language Pretraining with Self-Distillation
Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction
DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic Scenes
Transferable 3D Adversarial Shape Completion using Diffusion Models
Gradient-Aware for Class-Imbalanced Semi-supervised Medical Image Segmentation
Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising
Event-Adapted Video Super-Resolution
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation
LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection
Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction
On Pretraining Data Diversity for Self-Supervised Learning
Bayesian Self-Training for Semi-Supervised 3D Segmentation
Tri^{2}-plane: Thinking Head Avatar via Feature Pyramid
Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling
Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal
ParCo: Part-Coordinating Text-to-Motion Synthesis
Learning to Complement and to Defer to Multiple Users
Tiny Models are the Computational Saver for Large Models
Multi-Sentence Grounding for Long-term Instructional Video
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
Unveiling Privacy Risks in Stochastic Neural Networks Training: Effective Image Reconstruction from Gradients
Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°
KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
Rate-Distortion-Cognition Controllable Versatile Neural Image Compression
Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency
PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation
CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
Vista3D: unravel the 3d darkside of a single image
Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation
StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control
Segment and Recognize Anything at Any Granularity
ST-LLM: Large Language Models Are Effective Temporal Learners
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars
Exact Diffusion Inversion via Bidirectional Integration Approximation
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head
SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis
Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors
Object-Centric Diffusion for Efficient Video Editing
Single-Mask Inpainting for Voxel-based Neural Radiance Fields
Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval
SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
Agglomerative Token Clustering
CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection
NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition
GIVT: Generative Infinite-Vocabulary Transformers
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density
Multi-Modal Video Dialog State Tracking in the Wild
Factorized Diffusion: Perceptual Illusions by Noise Decomposition
Combining Generative and Geometry Priors for Wide-Angle Portrait Correction
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
StereoGlue: Joint Feature Matching and Robust Estimation
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction
Foster Adaptivity and Balance in Learning with Noisy Labels
Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection
Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM
AWOL: Analysis WithOut synthesis using Language
OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
Temporal Residual Jacobians for Rig-free Motion Transfer
Object-Aware NIR-to-Visible Translation
Taming Lookup Tables for Efficient Image Retouching
DualDn: Dual-domain Denoising via Differentiable ISP
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition
Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector
NICP: Neural ICP for 3D Human Registration at Scale
Syn-to-Real Domain Adaptation for Point Cloud Completion via Part-based Approach
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines
LiDAR-based All-weather 3D Object Detection via Prompting and Distilling 4D Radar
FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation
Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras
StableDrag: Stable Dragging for Point-based Image Editing
Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation
Scaling Up Personalized Image Aesthetic Assessment via Task Vector Customization
Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy
Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context
Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching
Monocular Occupancy Prediction for Scalable Indoor Scenes
Neural Surface Detection for Unsigned Distance Fields
Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation
Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes
Event-Based Motion Magnification
AdaIFL: Adaptive Image Forgery Localization via a Dynamic and Importance-aware Transformer Network
Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images
Towards Multimodal Sentiment Analysis Debiasing via Bias Purification
ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos
MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty
PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
Event-based Head Pose Estimation: Benchmark and Method
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
PSALM: Pixelwise Segmentation with Large Multi-modal Model
Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging
Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
Robustness Tokens: Towards Adversarial Robustness of Transformers
DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images
DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models
PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments
Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision
Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models
EINet: Point Cloud Completion via Extrapolation and Interpolation
Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases
Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation
ReCON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories
AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval
TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models
DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays
StyleCity: Large-Scale 3D Urban Scenes Stylization
Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
ViG-Bias: Visually Grounded Bias Discovery and Mitigation
Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors
DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior
Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting
Relightable Neural Actor with Intrinsic Decomposition and Pose Control
Assessing Sample Quality via the Latent Space of Generative Models
Enhancing Vectorized Map Perception with Historical Rasterized Maps
Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model
Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation
M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models
Responsible Visual Editing
Consistent 3D Line Mapping
Distributed Active Client Selection With Noisy Clients Using Model Association Scores
PixOOD: Pixel-Level Out-of-Distribution Detection
SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving
ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Editable Image Elements for Controllable Synthesis
General Geometry-aware Weakly Supervised 3D Object Detection
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning
EA-VTR: Event-Aware Video-Text Retrieval
GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns
POA: Pre-training Once for Models of All Sizes
Towards a Density Preserving Objective Function for Learning on Point Sets
VF-NeRF: Viewshed Fields for Rigid NeRF Registration
RSL-BA: Rolling Shutter Line Bundle Adjustment
Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction
Trainable Highly-expressive Activation Functions
MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation
RealViformer: Investigating Attention for Real-World Video Super-Resolution
Do text-free diffusion models learn discriminative visual representations?
Part2Object: Hierarchical Unsupervised 3D Instance Segmentation
Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation
Training-Free Model Merging for Multi-target Domain Adaptation
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Instant 3D Human Avatar Generation using Image Diffusion Models
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
DOCCI: Descriptions of Connected and Contrasting Images
Drag Anything: Motion Control for Anything using Entity Representation
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception
ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction
A Rotation-invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images
EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks
AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition
LogoSticker: Inserting Logos into Diffusion Models for Customized Generation
R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection
McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
LEROjD: Lidar Extended Radar-Only Object Detection
ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation
Probabilistic Image-Driven Traffic Modeling via Remote Sensing
VideoStudio: Generating Consistent-Content and Multi-Scene Videos
Semantic Residual Prompts for Continual Learning
EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection
Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
Occupancy as Set of Points
UAV First-Person Viewers Are Radiance Field Learners
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
A Fair Ranking and New Model for Panoptic Scene Graph Generation
ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning
Situated Instruction Following
M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
Holodepth: Programmable Depth-Varying Projection via Computer-Generated Holography
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance
GalLop: Learning global and local prompts for vision-language models
Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor
Two-Stage Video Shadow Detection via Temporal-Spatial Adaption
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization
Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation
Lossy Image Compression with Foundation Diffusion Models
UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving
CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
FMBoost: Boosting Latent Diffusion with Flow Matching
Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection
M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation
Shifted Autoencoders for Point Annotation Restoration in Object Counting
An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes
Kernel Diffusion: An Alternate Approach to Blind Deconvolution
FoundPose: Unseen Object Pose Estimation with Foundation Features
LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration
Diffusion Models as Data Mining Tools
Graph Neural Network Causal Explanation via Neural Causal Models
SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather
SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models
PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
Improving Adversarial Transferability via Model Alignment
RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios
ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation
Embodied Understanding of Driving Scenarios
NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation
ViLA: Efficient Video-Language Alignment for Video Question Answering
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Factorizing Text-to-Video Generation by Explicit Image Conditioning
MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices
Open-Set Biometrics: Beyond Good Closed-Set Models
Osmosis: RGBD Diffusion Prior for Underwater Image Restoration
Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization
Computing the Lipschitz constant needed for fast scene recovery from CASSI measurements
DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields
Flowed Time of Flight Radiance Fields
Cut out the Middleman: Revisiting Pose-based Gait Recognition
3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing
Fast Registration of Photorealistic Avatars for VR Facial Animation
CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs
FedHARM: Harmonizing Model Architectural Diversity in Federated Learning
Thinking Outside the BBox: Unconstrained Generative Object Compositing
EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS
Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance
TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Autonomous Driving
RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark
EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models
RICA^2: Rubric-Informed, Calibrated Assessment of Actions
Commonly Interesting Images
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities
CriSp: Leveraging Tread Depth Maps for Enhanced Crime-Scene Shoeprint Matching
Caltech Aerial RGB-Thermal Dataset in the Wild
Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
CityGuessr: City-Level Video Geo-Localization on a Global Scale
Bayesian Detector Combination for Object Detection with Crowdsourced Annotations
Revising Densification in Gaussian Splatting
FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing
Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions
UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation
PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis
A Graph-Based Approach for Category-Agnostic Pose Estimation
Depth-guided NeRF Training via Earth Mover’s Distance
INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding
DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks
Diagnosing and Re-learning for Balanced Multimodal Learning
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders
Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration
BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data
AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation
CARB-Net: Camera-Assisted Radar-Based Network for Vulnerable Road User Detection
SAH-SCI: Self-Supervised Adapter for Efficient Hyperspectral Snapshot Compressive Imaging
Minimalist Vision with Freeform Pixels
All You Need is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation
LatentEditor: Text Driven Local Editing of 3D Scenes
POET: Prompt Offset Tuning for Continual Human Action Adaptation
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs
TrafficNight : An Aerial Multimodal Benchmark For Nighttime Vehicle Surveillance
Towards Open Domain Text-Driven Synthesis of Multi-Person Motions
Loc3Diff: Local Diffusion for 3D Human Head Synthesis and Editing
Generative End-to-End Autonomous Driving
Learning to Distinguish Samples for Generalized Category Discovery
COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark
Diff-Reg: Diffusion Model in Doubly Stochastic Matrix Space for Registration Problem
WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning
Towards Dual Transparent Liquid Level Estimation in Biomedical Lab: Dataset, Methods and Practice
Encapsulating Knowledge in One Prompt
Delving into Adversarial Robustness on Document Tampering Localization
Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing
Confidence-Based Iterative Generation for Real-World Image Super-Resolution
Seeing Faces in Things: A Model and Dataset for Pareidolia
Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering
AMD: Automatic Multi-step Distillation of Large-scale Vision Models
FairViT: Fair Vision Transformer via Adaptive Masking
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Frugal 3D Point Cloud Model Training via Progressive Near Point Filtering and Fused Aggregation
HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation
Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction
Investigating Style Similarity in Diffusion Models
JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention
MagicMirror: Fast and High-Quality Avatar Generation with Constrained Search Space
EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors
GTMS: A Gradient-driven Tree-guided Mask-free Referring Image Segmentation Method
SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction
VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving
Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
Unmasking Bias in Diffusion Model Training
Multimodal Label Relevance Ranking via Reinforcement Learning
A Simple Background Augmentation Method for Object Detection with Diffusion Model
BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events
A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization
Deep Polarization Cues for Single-shot Shape and Subsurface Scattering Estimation
Sparse Refinement for Efficient High-Resolution Semantic Segmentation
Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
An Explainable Vision Question Answer Model via Diffusion Chain-of-Thought
Fast Sprite Decomposition from Animated Graphics
Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection
IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection
PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation
Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation
CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs
UGG: Unified Generative Grasping
A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation
Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt
GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images
Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation
Training-free Composite Scene Generation for Layout-to-Image Synthesis
Robustness Preserving Fine-tuning using Neuron Importance
ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
Similarity of Neural Architectures using Adversarial Attack Transferability
Dual-Rain: Video Rain Removal using Assertive and Gentle Teachers
PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation
Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks
Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement
Scene-Conditional 3D Object Stylization and Composition
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
Revisit Anything: Visual Place Recognition via Image Segment Retrieval
Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation
DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation
Self-Guided Generation of Minority Samples Using Diffusion Models
DEVIAS: Learning Disentangled Video Representations of Action and Scene
RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting
Class-Agnostic Object Counting with Text-to-Image Diffusion Model
Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks
Forbes: Face Obfuscation Rendering via Backpropagation Refinement Scheme
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
Information Bottleneck Based Data Correction in Continual Learning
A Watermark-Conditioned Diffusion Model for IP Protection
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning
FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion
Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation
On Spectral Properties of Gradient-based Explanation Methods
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
Generalizing to Unseen Domains via Text-guided Augmentation
Contextual Correspondence Matters: Bidirectional Graph Matching for Video Summarization
VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation
Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models
Zero-shot Text-guided Infinite Image Synthesis with LLM guidance
Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution
Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model
Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization
Adaptive Multi-head Contrastive Learning
Rotated Orthographic Projection for Self-Supervised 3D Human Pose Estimation
Easing 3D Pattern Reasoning with Side-view Features for Semantic Scene Completion
MO-EMT-NAS: Multi-Objective Continuous Transfer of Architectural Knowledge Between Tasks from Different Datasets
Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
Adaptive Annealing for Robust Averaging
MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering
Early Anticipation of Driving Maneuvers
SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization
On the Evaluation Consistency of Attribution-based Explanations
Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction
DreamReward: Aligning Human Preference in Text-to-3D Generation
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks
CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches
Towards Image Ambient Lighting Normalization
FedHide: Federated Learning by Hiding in the Neighbors
Self-Cooperation Knowledge Distillation for Novel Class Discovery
SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection
Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?
A Comparative Study of Image Restoration Networks for General Backbone Network Design
HoloADMM: High-Quality Holographic Complex Field Recovery
Synthesizing Time-varying BRDFs via Latent Space
Fundamental Matrix Estimation Using Relative Depths
MTaDCS: Moving Trace and Feature Density-based Confidence Sample Selection under Label Noise
Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis
DataDream: Few-shot Guided Dataset Generation
LPViT: Low-Power Semi-structured Pruning for Vision Transformers
Weighted Ensemble Models Are Strong Continual Learners
GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time
Learning Equilibrium Transformation for Gamut Expansion and Color Restoration
Physics-informed Knowledge Transfer for Underwater Monocular Depth Estimation
Robust Nearest Neighbors for Source-Free Domain Adaptation under Class Distribution Shift
Chains of Diffusion Models
Feature Diversification and Adaptation for Federated Domain Generalization
TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling
Dataset Distillation by Automatic Training Trajectories
RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes
RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection
Learning Neural Deformation Representation for 4D Dynamic Shape Generation
Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models
Domain Shifting: A Generalized Solution for Heterogeneous Cross-Modality Person Re-Identification
Self-Supervised Video Desmoking for Laparoscopic Surgery
Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining
Continuity Preserving Online CenterLine Graph Learning
Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping
MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections
Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection
AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking
HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos
Online Video Quality Enhancement with Spatial-Temporal Look-up Tables
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
Oulu Remote-photoplethysmography Physical Domain Attacks Database (ORPDAD)
Leveraging Imperfect Restoration for Data Availability Attack
DoubleTake: Geometry Guided Depth Estimation
Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition
HiFi-123: Towards High-fidelity One Image to 3D Content Generation
Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View
Good Teachers Explain: Explanation-Enhanced Knowledge Distillation
FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models
Möbius Transform for Mitigating Perspective Distortions in Representation Learning
TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution Detection
CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction
Continual Learning and Unknown Object Discovery in 3D Scenes via Self-Distillation
DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting
Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking
Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection
Region-Native Visual Tokenization
The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization
Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond
Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
Multi-modal Crowd Counting via a Broker Modality
FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
ViPer: Visual Personalization of Generative Models via Individual Preference Learning
MLPHand: Real Time Multi-View 3D Hand Reconstruction via MLP Modeling
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
How Far Can a 1-Pixel Camera Go? Solving Vision Tasks using Photoreceptors and Computationally Designed Visual Morphology
MONTRAGE: Monitoring Training for Attribution of Generative Diffusion Models
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
Self-supervised visual learning from interactions with objects
OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation
BAFFLE: A Baseline of Backpropagation-Free Federated Learning
OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
Diverse Text-to-3D Synthesis with Augmented Text Embedding
LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang
AdversariaLeak: External Information Leakage Attack Using Adversarial Samples on Face Recognition Systems
SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation
Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier
Enhanced Sparsification via Stimulative Training
Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network
FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models
Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation
WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding
Spiking Wavelet Transformer
WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing
PDT Uav Target Detection Dataset for Pests and Diseases Tree
Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection
COD: Learning Conditional Invariant Representation for Domain Adaptation Regression
RANRAC: Robust Neural Scene Representations via Random Ray Consensus
LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding
SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks
DQ-DETR: DETR with Dynamic Query for Tiny Object Detection
SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians
Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections
Few-shot Defect Image Generation based on Consistency Modeling
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
Video Editing via Factorized Diffusion Distillation
Trackastra: Transformer-based cell tracking for live-cell microscopy
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM
Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation
GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring
Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring
ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion
Curved Diffusion: A Generative Model With Optical Geometry Control
CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning
OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation
Skeleton Recall Loss for Connectivity Conserving and Resource Efficient Segmentation of Thin Tubular Structures
Conceptual Codebook Learning for Vision-Language Models
AnimateMe: 4D Facial Expressions via Diffusion Models
LingoQA: Video Question Answering for Autonomous Driving
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention
PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors
iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning
Context Diffusion: In-Context Aware Image Generation
Pose Guided Fine-Grained Sign Language Video Generation
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Certifiably Robust Image Watermark
Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
Online Zero-Shot Classification with CLIP
SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning
Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder
Weakly-Supervised 3D Hand Reconstruction with Knowledge Prior and Uncertainty Guidance
3D Reconstruction of Objects in Hands without Real World 3D Supervision
To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of Point Cloud Transfer Learning
Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
Parameterized Quasi-Physical Simulators for Dexterous Manipulations Transfer
Optimization-based Uncertainty Attribution Via Learning Informative Perturbations
Semi-supervised Segmentation of Histopathology Images with Noise-Aware Topological Consistency
Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling
MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks
Explorative Inbetweening of Time and Space
A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control
Learning to Make Keypoints Sub-Pixel Accurate
Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images
Generalizable Human Gaussians for Sparse View Synthesis
Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off
GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction
AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation
PFedEdit: Personalized Federated Learning via Automated Model Editing
De-Confusing Pseudo-Labels in Source-Free Domain Adaptation
Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models
Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas
Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos
Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores
MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception
EpipolarGAN: Omnidirectional Image Synthesis with Explicit Camera Control
Photorealistic Video Generation with Diffusion Models
RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement
TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding
Self-Supervised Audio-Visual Soundscape Stylization
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Source-Free Domain-Invariant Performance Prediction
Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures
Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort
Direct Distillation between Different Domains
GRiT: A Generative Region-to-text Transformer for Object Understanding
LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System
Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning
Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending
Geometry Fidelity for Spherical Images
BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling
CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning
Free-Editor: Zero-shot Text-driven 3D Scene Editing
DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
Generalizable Symbolic Optimizer Learning
Tackling Structural Hallucination in Image Translation with Local Diffusion
Unified Medical Image Pre-training in Language-Guided Common Semantic Space
On the Vulnerability of Skip Connections to Model Inversion Attacks
Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector
Reinforcement Learning via Auxillary Task Distillation
DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation
View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields
Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay
Fairness-aware Vision Transformer via Debiased Self-Attention
Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry
Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update
SNP: Structured Neuron-level Pruning to Preserve Attention Scores
PALM: Predicting Actions through Language Models
Motion Keyframe Interpolation for Any Human Skeleton using Point Cloud-based Human Motion Data Homogenisation
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
Improving Hyperbolic Representations via Gromov-Wasserstein Regularization
VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG
DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose
Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense
Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery
DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation
Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation
PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation
PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control
LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation
Efficient Training with Denoised Neural Weights
Integration of Global and Local Representations for Fine-grained Cross-modal Alignment
Local and Global Flatness for Federated Domain Generalization
SRPose: Two-view Relative Pose Estimation with Sparse Keypoints
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs
Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation
Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement
Efficient Vision Transformers with Partial Attention
Generalized Coverage for More Robust Low-Budget Active Learning
Rasterized Edge Gradients: Handling Discontinuities Differentially
Kinetic Typography Diffusion Model
Enhancing Cross-Subject fMRI-to-Video Decoding with Global-Local Functional Alignment
ZeroI2V: Zero-Cost Adaptation of Pre-Trained Transformers from Image to Video
Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems
R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection
Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion
Data Poisoning Quantization Backdoor Attack
T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy
DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition
A high-quality robust diffusion framework for corrupted dataset
Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning
Distilling Knowledge from Large-Scale Image Models for Object Detection
Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection
TimeLens-XL: Real-time Event-based Video Frame Interpolation with Large Motion
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets
Unsupervised Representation Learning by Balanced Self Attention Matching
Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic Imaging
Enhancing Source-Free Domain Adaptive Object Detection with Low-confidence Pseudo Label Distillation
Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation
SCOD: From Heuristics to Theory
Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts
Teach CLIP to Develop a Number Sense for Ordinal Regression
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
Compact 3D Scene Representation via Self-Organizing Gaussian Grids
VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking
SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes
Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning
T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models
Towards Certifiably Robust Face Recognition
Linking in Style: Understanding learned features in deep learning models
Stable Video Portraits
CliffPhys: Camera-based Respiratory Measurement using Clifford Neural Networks
Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network
PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
Instant Uncertainty Calibration of NeRFs Using a Meta-Calibrator
SHIC: Shape-Image Correspondences with no Keypoint Supervision
Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection
Weight Conditioning for Smooth Optimization of Neural Networks
Energy-Clibrated VAE with Test Time Free Lunch
SceneTeller: Language-to-3D Scene Generation
MagMax: Leveraging Model Merging for Seamless Continual Learning
Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Debiasing surgeon: fantastic weights and how to find them
Denoising Vision Transformers
Differentiable Product Quantization for Memory Efficient Camera Relocalization
Spline-based Transformers
Learning Pseudo 3D Guidance for View-consistent Texturing with 2D Diffusion
SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data
TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly
Efficient NeRF Optimization - Not All Samples Remain Equally Hard
Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models
Catastrophic Overfitting: A Potential Blessing in Disguise
Adversarial Diffusion Distillation
Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection
Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models
Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information
Text-Conditioned Resampler For Long Form Video Understanding
Using My Artistic Style? You Must Obtain My Authorization
Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation
UMERegRobust – Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration
Non-transferable Pruning
A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis
Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning
Affine steerers for structured keypoint description
FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving
GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth
EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision
latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions
HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation
InstructGIE: Towards Generalizable Image Editing
Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning
CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models
Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation
Towards Scene Graph Anticipation
Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding
NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration
Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
DeTra: A Unified Model for Object Detection and Trajectory Forecasting
ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
Common Sense Reasoning for Deep Fake Detection
GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning
Tight and Efficient Upper Bound on Spectral Norm of Convolutional Layers
Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models
FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning
Learning Multimodal Latent Generative Models with Energy-Based Prior
Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution
Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable
CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting
Snuffy: Efficient Whole Slide Image Classifier
Learning to Build by Building Your Own Instructions
Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
CoTracker: It is Better to Track Together
Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Improving Text-guided Object Inpainting with Semantic Pre-inpainting
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views
SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic
LISO: Lidar-only Self-Supervised 3D Object Detection
Frontier-enhanced Topological Memory with Improved Exploration Awareness for Embodied Visual Navigation
Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)
LookupViT: Compressing visual information to a limited number of tokens
Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization
REDIR: Refocus-free Event-based De-occlusion Image Reconstruction
Towards compact reversible image representations for neural style transfer
InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping
Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data
MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning
GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity
KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition
Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation
Better Regression Makes Better Test-time Adaptive 3D Object Detection
Temporally Consistent Stereo Matching
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy
ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention
Asynchronous Large Language Model Enhanced Planner for Autonomous Driving
Benchmarking Spurious Bias in Few-Shot Image Classifiers
Deep Companion Learning: Enhancing Generalization Through Historical Consistency
WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
Straightforward Layer-wise Pruning for More Efficient Visual Adaptation
ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting
CrossScore: A Multi-View Approach to Image Evaluation and Scoring
CPM: Class-conditional Prompting Machine for Audio-visual Segmentation
DiffClass: Diffusion-Based Class Incremental Learning
Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning
PromptFusion: Decoupling Stability and Plasticity for Continual Learning
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm
PFGS: High Fidelity Point Cloud Rendering via Feature Splatting
DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing
Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework
DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception
PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning
Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking
Data Augmentation via Latent Diffusion for Saliency Prediction
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
3D Gaussian Parametric Head Model
Dynamic Neural Radiance Field From Defocused Monocular Video
Retargeting Visual Data with Deformation Fields
Ray-Distance Volume Rendering for Neural Scene Reconstruction
4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation
Spike-Temporal Latent Representation for Energy-Efficient Event-to-Video Reconstruction
Sur^2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images
ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion
Realistic Human Motion Generation with Cross-Diffusion Models
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
Continuous Memory Representation for Anomaly Detection
UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model
Diffusion Reward: Learning Rewards via Conditional Video Diffusion
Efficient Depth-Guided Urban View Synthesis
OneRestore: A Universal Restoration Framework for Composite Degradation
Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention
Beyond MOT: Semantic Multi-Object Tracking
PartCraft: Crafting Creative Objects by Parts
WordRobe: Text-Guided Generation of Textured 3D Garments
ZeST: Zero-Shot Material Transfer from a Single Image
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework
Online Continuous Generalized Category Discovery
AddMe: Zero-shot Group-photo Synthesis by Inserting People into Scenes
Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning
KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter
MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo
MC-PanDA: Mask Confidence for Panoptic Domain Adaptation
GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting
Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design
BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation
PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation
Rethinking Few-shot Class-incremental Learning: Learning from Yourself
VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement
STSP: Spatial-Temporal Subspace Projection for Video Class-incremental Learning
Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
Long-CLIP: Unlocking the Long-Text Capability of CLIP
RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF
Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation
FuseTeacher: Modality-fused Encoders are Strong Vision Supervisors
MVDD: Multi-View Depth Diffusion Models
Dataset Quantization with Active Learning based Adaptive Sampling
Interpretability-Guided Test-Time Adversarial Defense
Self-Supervised Representation Learning for Adversarial Attack Detection
GroundUp: Rapid Sketch-Based 3D City Massing
Photon Inhibition for Energy-Efficient Single-Photon Imaging
CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning
Learning with Counterfactual Explanations for Radiology Report Generation
Pseudo-Embedding for Generalized Few-Shot Point Cloud Segmentation
Wavelet Convolutions for Large Receptive Fields
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer
Gradient-based Out-of-Distribution Detection
Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNs
Non-Exemplar Domain Incremental Learning via Cross-Domain Concept Integration
Data-to-Model Distillation: Data-Efficient Learning Framework
Simple Unsupervised Knowledge Distillation With Space Similarity
3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance
DSMix: Distortion-Induced Saliency Map Based Pre-training for No-Reference Image Quality Assessment
Learning Natural Consistency Representation for Face Forgery Video Detection
DragVideo: Interactive Drag-style Video Editing
Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging
One-Shot Diffusion Mimicker for Handwritten Text Generation
Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning
Multi-Person Pose Forecasting with Individual Interaction Perceptron and Prior Learning
FunQA: Towards Surprising Video Comprehension
Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network
UpFusion: Novel View Diffusion from Unposed Sparse View Observations
EDformer: Transformer-Based Event Denoising Across Varied Noise Levels
UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation
View-Consistent 3D Editing with Gaussian Splatting
Few-shot NeRF by Adaptive Rendering Loss Regularization
HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes
FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis
Generating Human Interaction Motions in Scenes with Text Control
Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging
MeshSegmenter: Zero-Shot Mesh Segmentation via Texture Synthesis
VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing
MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping
FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting
VersatileGaussian: Real-time Neural Rendering for Versatile Tasks using Gaussian Splatting
Instruction Tuning-free Visual Token Complement for Multimodal LLMs
Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance
Pyramid Diffusion for Fine 3D Large Scene Generation
Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts
MotionChain: Conversational Motion Controllers via Multimodal Prompts
Synthesizing Environment-Specific People in Photographs
Open-World Dynamic Prompt and Continual Visual Representation Learning
Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning
Customized Generation Reimagined: Fidelity and Editability Harmonized
HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation
Text2LiDAR: Text-guided LiDAR Point Clouds Generation via Equirectangular Transformer
Co-speech Gesture Video Generation with 3D Human Meshes
SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields
DiffusionPen: Towards Controlling the Style of Handwritten Text Generation
From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation
MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection
Revisit Self-supervision with Local Structure-from-Motion
On the Viability of Monocular Depth Pre-training for Semantic Segmentation
Weakly-supervised Camera Localization by Ground-to-satellite Image Registration
NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation
Latent Guard: a Safety Framework for Text-to-image Generation
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection
ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers
ProtoComp: Diverse Point Cloud Completion with Controllable Prototype
FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation
Physical-Based Event Camera Simulator
Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture
EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
Implicit Steganography Beyond the Constraints of Modality
Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos
Volumetric Rendering with Baked Quadrature Fields
Flying with Photons: Rendering Novel Views of Propagating Light
LivePhoto: Real Image Animation with Text-guided Motion Control
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
High-Fidelity and Transferable NeRF Editing by Frequency Decomposition
Implicit Style-Content Separation using B-LoRA
Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer.
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems
OAPT: Offset-Aware Partition Transformer for Double JPEG Artifacts Removal
Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration
Understanding Multi-compositional learning in Vision and Language models via Category Theory
Animate Your Motion: Turning Still Images into Dynamic Videos
Spatial-Temporal Multi-level Association for Video Object Segmentation
Point-supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance
CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection
Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast
NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image
Face Reconstruction Transfer Attack as Out-of-Distribution Generalization
Rethinking Image-to-Video Adaptation: An Object-centric Perspective
Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing
Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
UniProcessor: A Text-induced Unified Low-level Image Processor
Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
Tokenize Anything via Prompting
Visual Alignment Pre-training for Sign Language Translation
GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering
Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection
Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing
Evaluating Text-to-Visual Generation with Image-to-Text Generation
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Arc2Face: A Foundation Model for ID-Consistent Human Faces
Let the Avatar Talk using Texts without Paired Training Data
LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement
Learning Camouflaged Object Detection from Noisy Pseudo Label
PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition
SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization
Attention Beats Linear for Fast Implicit Neural Representation Generation
WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation
Timestep-Aware Correction for Quantized Diffusion Models
LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models
Prompt-Based Test-Time Real Image Dehazing: A Novel Pipeline
RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-based Continual Learning
FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning
Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge
Emerging Property of Masked Token for Effective Pre-training
OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
Gaussian Grouping: Segment and Edit Anything in 3D Scenes
3D Hand Sequence Recovery from Real Blurry Images and Event Stream
Sapiens: Foundation for Human Vision Models
Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model
SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers
Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images
ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model
Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration
Segmentation-guided Layer-wise Image Vectorization with Gradient Fills
IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination
SAM-guided Graph Cut for 3D Instance Segmentation
GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation
A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting
VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation
Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion
TLControl: Trajectory and Language Control for Human Motion Synthesis
StructLDM: Structured Latent Diffusion for 3D Human Generation
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
High-Fidelity Modeling of Generalizable Wrinkle Deformation
COMPOSE: Comprehensive Portrait Shadow Editing
GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations
Learning Representations from Foundation Models for Domain Generalized Stereo Matching
Distractor-Free Novel View Synthesis via Exploiting Memorization Effect in Optimization
PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration
MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps
NeRF-XL: NeRF at Any Scale with Multi-GPU
NOVUM: Neural Object Volumes for Robust Object Classification
De-confounded Gaze Estimation
3D Hand Pose Estimation in Everyday Egocentric Images
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
Controllable Human-Object Interaction Synthesis
Nymeria: A Massive Collection of Egocentric Multi-modal Human Motion in the Wild
Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving
DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model
Zero-Shot Multi-Object Scene Completion
Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene
Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation
DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment
Personalized Video Relighting With an At-Home Light Stage
Six-Point Method for Multi-Camera Systems with Reduced Solution Space
UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation
Tuning-Free Image Customization with Image and Text Guidance
Stripe Observation Guided Inference Cost-free Attention Mechanism
MegaScenes: Scene-Level View Synthesis at Scale
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation
SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging
FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models
Non-parametric Sensor Noise Modeling and Synthesis
Learned Image Enhancement via Color Naming
Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation
Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection
Navigating Text-to-Image Generative Bias across Indic Languages
Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction
Improving Diffusion Models for Authentic Virtual Try-on in the Wild
LCM-Lookahead for Encoder-based Text-to-Image Personalization
COIN-Matting: Confounder Intervention for Image Matting
GaussReg: Fast 3D Registration with Gaussian Splatting
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Score Distillation Sampling with Learned Manifold Corrective
WAS: Dataset and Methods for Artistic Text Segmentation
Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
BAMM: Bidirectional Autoregressive Motion Model
AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution
Image-adaptive 3D Lookup Tables for Real-time Image Enhancement with Bilateral Grids
Region-Aware Sequence-to-Sequence Learning for Hyperspectral Denoising
ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization
Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement
Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering
Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition
Agent Attention: On the Integration of Softmax and Linear Attention
Fine-grained Dynamic Network for Generic Event Boundary Detection
Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution
Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction
VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network
GraspXL: Generating Grasping Motions for Diverse Objects at Scale
Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition
Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Audio-visual Generalized Zero-shot Learning the Easy Way
Pre-trained Visual Dynamics Representations for Efficient Policy Learning
Reinforcement Learning Friendly Vision-Language Model for Minecraft
GRAPE: Generalizable and Robust Multi-view Facial Capture
R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations
Agent3D-Zero: An Agent for Zero-shot 3D Understanding
Multiscale Sliced Wasserstein Distances as Perceptual Color Difference Measures
SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models
Structured-NeRF: Hierarchical Scene Graph with Neural Representation
MetaWeather: Few-Shot Weather-Degraded Image Restoration
Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency
MeshFeat: Multi-Resolution Features for Neural Fields on Meshes
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects
Surface Reconstruction for 3D Gaussian Splatting via Local Structural Hints
PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion
Learning to Unlearn for Robust Machine Unlearning
Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers
Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning
MinD-3D: Reconstruct High-quality 3D objects in Human Brain
Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits
Visual Text Generation in the Wild
Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement
E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation
A Unified Image Compression Method for Human Perception and Multiple Vision Tasks
Diffusion for Natural Image Matting
Eliminating Warping Shakes for Unsupervised Online Video Stitching
FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification
Facial Affective Behavior Analysis with Instruction Tuning
Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection
Learning Quantized Adaptive Conditions for Diffusion Models
Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation
Discovering Unwritten Visual Classifiers with Large Language Models
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
GenQ: Quantization in Low Data Regimes with Generative Synthetic Data
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach
Hetecooper: Feature Collaboration Graph for Heterogeneous Collaborative Perception
DATENeRF: Depth-Aware Text-based Editing of NeRFs
Soft Prompt Generation for Domain Generalization
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement
On the Approximation Risk of Few-Shot Class-Incremental Learning
Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
Fast Encoding and Decoding for Implicit Video Representation
SAIR: Learning Semantic-aware Implicit Representation
Just a Hint: Point-Supervised Camouflaged Object Detection
Rethinking Normalization Layers for Domain Generalizable Person Re-identification
URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields
Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos
Efficient Cascaded Multiscale Adaptive Network for Image Restoration
ConGeo: Robust Cross-view Geo-localization across Ground View Variations
Learning to Drive via Asymmetric Self-Play
Event-based Mosaicing Bundle Adjustment
Robust-Wide: Robust Watermarking against Instruction-driven Image Editing
FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model
MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos
V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation
OmniSat: Self-Supervised Modality Fusion for Earth Observation
WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation
TriNeRFLet: A Wavelet Based Triplane NeRF Representation
Uncertainty-Driven Spectral Compressive Imaging with Spatial-Frequency Transformer
milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing
Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment
Toward Tiny and High-quality Facial Makeup with Data Amplify Learning
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
Bidirectional Progressive Transformer for Interaction Intention Anticipation
Semantic-guided Robustness Tuning for Few-Shot Transfer Across Extreme Domain Shift
SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow
Domain Reduction Strategy for Non-Line-of-Sight Imaging
Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging
EcoMatcher: Efficient Clustering Oriented Matcher for Detector-free Image Matching
Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning
FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation
C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
Vamos: Versatile Action Models for Video Understanding
A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation
DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation
AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation
ExMatch: Self-guided Exploitation for Semi-Supervised Learning with Scarce Labeled Samples
TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data
CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering
CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems
Open-Vocabulary RGB-Thermal Semantic Segmentation
RaFE: Generative Radiance Fields Restoration
denoiSplit: a method for joint microscopy image splitting and unsupervised denoising
UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening
Efficient Neural Video Representation with Temporally Coherent Modulation
Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution
Unsupervised Moving Object Segmentation with Atmospheric Turbulence
Modeling Label Correlations with Latent Context for Multi-Label Recognition
Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting
MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos
WindPoly: Polygonal Mesh Reconstruction via Winding Numbers
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
Towards Reliable Advertising Image Generation Using Human Feedback
Distributionally Robust Loss for Long-Tailed Multi-Label Image Classification
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Inter-Class Topology Alignment for Efficient Black-Box Substitute Attacks
TurboEdit: Real-time text-based disentangled real image editing
The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples
Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery
Harmonizing knowledge Transfer in Neural Network with Unified Distillation
MoEAD: A Parameter-efficient Model for Multi-class Anomaly Detection
Clean & Compact: Efficient Data-Free Backdoor Defense with Model Compactness
Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation
EraseDraw : Learning to Insert Objects by Erasing Them from Images
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
Attention Prompting on Image for Large Vision-Language Models
ELSE: Efficient Deep Neural Network Inference through Line-based Sparsity Exploration
Personalized Privacy Protection Mask Against Unauthorized Facial Recognition
Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization
A Cephalometric Landmark Regression Method based on Dual-encoder for High-resolution X-ray Image
HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
Viewpoint textual inversion: discovering scene representations and 3D view control in 2D diffusion models
A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability
CipherDM: Secure Three-Party Inference for Diffusion Model Sampling
How to Train the Teacher Model for Effective Knowledge Distillation
LineFit: A Geometric Approach for Fitting Line Segments in Images
CompGS: Smaller and Faster Gaussian Splatting with Vector Quantization
Global Counterfactual Directions
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
ChEX: Interactive Localization and Region Description in Chest X-rays
MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration
Grounding Image Matching in 3D with MASt3R
COSMU: Complete 3D human shape from monocular unconstrained images
LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation
Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels
Adaptive Human Trajectory Prediction via Latent Corridors
Generalizable Facial Expression Recognition
RS-NeRF: Neural Radiance Fields from Rolling Shutter Images
MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain
Do Generalised Classifiers really work on Human Drawn Sketches?
Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures
Grid-Attention: Enhancing Computational Efficiency of Large Vision Models without Fine-Tuning
Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection
GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views
Enhanced Motion Forecasting with Visual Relation Reasoning
Multi-scale Cross Distillation for Object Detection in Aerial Images
Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models
DSA: Discriminative Scatter Analysis for Early Smoke Segmentation
Long-term Temporal Context Gathering for Neural Video Compression
DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences
Continuous SO(3) Equivariant Convolution for 3D Point Cloud Analysis
SemanticHuman-HD: High Resolution Semantic disentangled 3D Human Generation
MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks
Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing
Norface: Improving Facial Expression Analysis by Identity Normalization
Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
Bucketed Ranking-based Losses for Efficient Training of Object Detectors
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Test-Time Stain Adaptation with Diffusion Models for Histopathology Image Classification
Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM
Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition
The Nerfect Match: Exploring NeRF Features for Visual Localization
SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization
Image Manipulation Detection With Implicit Neural Representation and Limited Supervision
Adapting to Shifting Correlations with Unlabeled Data Calibration
SCAPE: A Simple and Strong Category-Agnostic Pose Estimator
FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients
Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model
Image-to-Lidar Relational Distillation for Autonomous Driving Data
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions
Learning-based Axial Video Motion Magnification
IGNORE: Information Gap-based False Negative Loss Rejection for Single Positive Multi-Label Learning
Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization
AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset
RegionDrag: Fast Region-Based Image Editing with Diffusion Models
FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts
CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection
Siamese Vision Transformers are Scalable Audio-visual Learners
Rectify the Regression Bias in Long-Tailed Object Detection
Learning Neural Volumetric Pose Features for Camera Localization
Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection
Visual Relationship Transformation
Scene-aware Human Motion Forecasting via Mutual Distance Prediction
Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding
AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation
Visible and Clear: Finding Tiny Objects in Difference Map
Elysium: Exploring Object-level Perception in Videos through Semantic Integration Using MLLMs
Sequential Representation Learning via Static-Dynamic Conditional Disentanglement
Temporal-Mapping Photography for Event Cameras
RGBD GS-ICP SLAM
SAVE: Protagonist Diversification with Structure Agnostic Video Editing
Rethinking Data Bias: Dataset Copyright Protection via Embedding Class-wise Hidden Bias
Federated Learning with Local Openset Noisy Labels
Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching
End-to-End Rate-Distortion Optimized 3D Gaussian Representation
Multistain Pretraining for Slide Representation Learning in Pathology
Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning
Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation
GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes
3R-INN: How to be climate friendly while consuming/delivering videos?
ADMap: Anti-disturbance Framework for Vectorized HD Map Construction
GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields
OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction
Self-supervised co-salient object detection via feature correspondences at multiple scales
Improving Knowledge Distillation via Regularizing Feature Direction and Norm
DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models
Reinforcement Learning Meets Visual Odometry
PoseSOR: Human Pose Can Guide Our Attention
Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning
SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models
Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo
Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Dual-stage Hyperspectral Image Classification Model with Spectral Supertoken
Optimal Transport of Diverse Unsupervised Tasks for Robust Learning from Noisy Few-Shot Data
Non-Line-of-Sight Estimation of Fast Human Motion with Slow Scanning Imagers
HPE-Li: WiFi-enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation
LITA: Language Instructed Temporal-Localization Assistant
Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving
Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks
BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow
MEVG : Multi-event Video Generation with Text-to-Video Models
Unsupervised Dense Prediction using Differentiable Normalized Cuts
RPBG: Towards Robust Neural Point-based Graphics in the Wild
uCAP: An Unsupervised Prompting Method for Vision-Language Models
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing
Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration
PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
Human Motion Forecasting in Dynamic Domain Shifts: A Homeostatic Continual Test-time Adaptation Framework
Placing Objects in Context via Inpainting for Out-of-distribution Segmentation
Efficient Frequency-Domain Image Deraining with Contrastive Regularization
EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset
Deep Cost Ray Fusion for Sparse Depth Video Completion
Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation
SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning
Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours Maps
Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment
Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification
Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models
SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection
Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification
Noise-assisted Prompt Learning for Image Forgery Detection and Localization
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
Dynamic Retraining-Updating Mean Teacher for Source-Free Object Detection
Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation
An accurate detection is not all you need to combat label noise in web-noisy datasets
Self-Supervised Video Copy Localization with Regional Token Representation
Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes
Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses
Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression
Zero-shot Object Counting with Good Exemplars
On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition
Synergy of Sight and Semantics: Visual Intention Understanding with CLIP
FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians
Single-Photon 3D Imaging with Equi-Depth Photon Histograms
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
Anytime Continual Learning for Open Vocabulary Classification
Gated Temporal Diffusion for Stochastic Long-term Dense Anticipation
Domain Generalization of 3D Object Detection by Density-Resampling
Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients
O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation
On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Causal Subgraphs and Information Bottlenecks: Redefining OOD Robustness in Graph Neural Networks
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects
Watching it in Dark: A Target-aware Representation Learning Framework for High-Level Vision Tasks in Low Illumination
SINDER: Repairing the Singular Defects of DINOv2
Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians
Revisiting Domain-Adaptive Object Detection in Adverse Weather by the Generation and Composition of High-Quality Pseudo-Labels
Open-Set Recognition in the Age of Vision-Language Models
Two-Stage Active Learning for Efficient Temporal Action Segmentation
High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior
ARoFace: Alignment Robustness to Improve Low-quality Face Recognition
Un-EVIMO: Unsupervised Event-based Independent Motion Segmentation
Finding a needle in a haystack: A Black-Box Approach to Invisible Watermark Detection
Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution Images
CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation
Faceptor: A Generalist Model for Face Perception
Shapefusion: 3D localized human diffusion models
LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images
Training A Secure Model against Data-Free Model Extraction
VeCLIP: Improving CLIP Training via Visual-enriched Captions
Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
Neural Metamorphosis
Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data
UniCode : Learning a Unified Codebook for Multimodal Large Language Models
When Fast Fourier Transform Meets Transformer for Image Restoration
DGD: Dynamic 3D Gaussians Distillation
OMR: Occlusion-Aware Memory-Based Refinement for Video Lane Detection
Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation
HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild
Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation
Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning
Learning Cross-hand Policies of High-DOF Reaching and Grasping
Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion
Probabilistic Weather Forecasting with Deterministic Guidance-based Diffusion Model
Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution
Keypoint Promptable Re-Identification
When and How do negative prompts take effect?
Rethinking Features-Fused-Pyramid-Neck for Object Detection
Training A Small Emotional Vision Language Model for Visual Art Comprehension
Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes
Learned HDR Image Compression for Perceptually Optimal Storage and Display
FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos
On the Topology Awareness and Generalization Performance of Graph Neural Networks
Accelerating Image Super-Resolution Networks with Pixel-Level Classification
On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines
IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models
SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras
Compensation Sampling for Improved Convergence in Diffusion Models
Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting
SkyScenes: A Synthetic Dataset for Aerial Scene Understanding
RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks
KeypointDETR: An End-to-End 3D Keypoint Detector
CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion
Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes
IAM-VFI : Interpolate Any Motion for Video Frame Interpolation with motion complexity map
Implicit Neural Models to Extract Heart Rate from Video
Self-Supervised Any-Point Tracking by Contrastive Random Walks
Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach
OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions
Statewide Visual Geolocalization in the Wild
Deblurring 3D Gaussian Splatting
SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models
Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography
Layer-Wise Relevance Propagation with Conservation Property for ResNet
Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis
Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image
Few-shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt
Decomposition Betters Tracking Everything Everywhere
R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding
Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation
Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations
Controlling the World by Sleight of Hand
Pseudo-Labelling Should Be Aware of Disguising Channel Activations
Towards Architecture-Agnostic Untrained Networks Priors for Image Reconstruction with Frequency Regularization
We use cookies to store which papers have been visited.
I agree
Successful Page Load
ECCV uses cookies for essential functions only. We do not sell your personal information.
Our Privacy Policy »
Accept Cookies
We use cookies to store which papers have been visited.
I agree