VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video

Yu Liu1,2, Baoxiong Jia2*†, Ruijie Lu2,3, Chuyue Gan1, Huayu Chen1, Junfeng Ni1,2,
Song-Chun Zhu1,2,3, Siyuan Huang2†

* Project Lead, Corresponding Author

1Tsinghua University    2State Key Laboratory of General Artificial Intelligence (BIGAI)    3Peking University

Reconstruction from Monocular Video

Capturing video with any mobilephone

More Results

Abstract

Building digital twins of articulated objects from monocular video presents an essential challenge in computer vision, which requires simultaneous reconstruction of object geometry, part segmentation, and articulation parameters from limited viewpoint inputs. Monocular video offers an attractive input format due to its simplicity and scalability; however, it's challenging to disentangle the object geometry and part dynamics with visual supervision alone, as the joint movement of the camera and parts leads to ill-posed estimation. While motion priors from pre-trained tracking models can alleviate the issue, how to effectively integrate them for articulation learning remains largely unexplored. To address this problem, we introduce VideoArtGS, a novel approach that reconstructs high-fidelity digital twins of articulated objects from monocular video. We propose a motion prior guidance pipeline that analyzes 3D tracks, filters noise, and provides reliable initialization of articulation parameters. We also design a hybrid center-grid part assignment module for articulation-based deformation fields that captures accurate part motion. VideoArtGS demonstrates state-of-the-art performance in articulation and mesh reconstruction, reducing the reconstruction error by about two orders of magnitude compared to existing methods. VideoArtGS enables practical digital twin creation from monocular video, establishing a new benchmark for video-based articulated object reconstruction.

Method Overview

The overview of VideoArtGS. Given video frames, we first use VGGT to estimate the depths along with camera poses and then use TAPIP3D to obtain 3D tracks. We design a motion prior guidance pipeline to analyze and group these tracks, initializing our articulation-based deformation field with motion information and optimizing it with tracking loss. Finally, we reconstruct the object with 3D Gaussians and the deformation field, jointly optimizing both modules by rendering and tracking loss.

Rendering Results of Reconstructed Objects



Simulation in IsaacSim

Left: synthetic objects. Right: real-world objects.





Interactable Meshes of Complex Synthetic Objects

*Drag mouse to rotate & scroll wheel to zoom in/out


Interactable Meshes of Complex Real-world Objects

*Drag mouse to rotate & scroll wheel to zoom in/out


Comparisons with Existing Methods



Two-part Objects



Multi-part Objects



BibTeX