VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video

Yu Liu^1,2, Baoxiong Jia^2*†, Ruijie Lu^2,3, Chuyue Gan¹, Huayu Chen¹, Junfeng Ni^1,2,
Song-Chun Zhu^1,2,3, Siyuan Huang^2†

* Project Lead, ^† Corresponding Author

¹Tsinghua University ²State Key Laboratory of General Artificial Intelligence (BIGAI) ³Peking University

Paper arXiv Code (Released) Data (Released)

Reconstruction from Monocular Video

Capturing video with any mobilephone

More Results

Abstract

Building digital twins of articulated objects from monocular video presents an essential challenge in computer vision, which requires simultaneous reconstruction of object geometry, part segmentation, and articulation parameters from limited viewpoint inputs. Monocular video offers an attractive input format due to its simplicity and scalability; however, it's challenging to disentangle the object geometry and part dynamics with visual supervision alone, as the joint movement of the camera and parts leads to ill-posed estimation. While motion priors from pre-trained tracking models can alleviate the issue, how to effectively integrate them for articulation learning remains largely unexplored. To address this problem, we introduce VideoArtGS, a novel approach that reconstructs high-fidelity digital twins of articulated objects from monocular video. We propose a motion prior guidance pipeline that analyzes 3D tracks, filters noise, and provides reliable initialization of articulation parameters. We also design a hybrid center-grid part assignment module for articulation-based deformation fields that captures accurate part motion. VideoArtGS demonstrates state-of-the-art performance in articulation and mesh reconstruction, reducing the reconstruction error by about two orders of magnitude compared to existing methods. VideoArtGS enables practical digital twin creation from monocular video, establishing a new benchmark for video-based articulated object reconstruction.

Method Overview

The overview of VideoArtGS. Given video frames, we first use VGGT to estimate the depths along with camera poses and then use TAPIP3D to obtain 3D tracks. We design a motion prior guidance pipeline to analyze and group these tracks, initializing our articulation-based deformation field with motion information and optimizing it with tracking loss. Finally, we reconstruct the object with 3D Gaussians and the deformation field, jointly optimizing both modules by rendering and tracking loss.

Rendering Results of Reconstructed Objects

Simulation in IsaacSim

Left: synthetic objects. Right: real-world objects.

Interactable Meshes of Complex Synthetic Objects

^*Drag mouse to rotate & scroll wheel to zoom in/out

Bucket-100481

Table-30666

StorageFurniture-45612

Window-103015

Door-8961

Refriderator-10489

Faucet-168

Eyeglasses-101287

Printer-103811

Oven-101808

Interactable Meshes of Complex Real-world Objects

^*Drag mouse to rotate & scroll wheel to zoom in/out

cabinet

chair

coffee machine

laptop

microwave

Comparisons with Existing Methods

Two-part Objects

Multi-part Objects

VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video

Reconstruction from Monocular Video

Capturing video with any mobilephone

More Results

Abstract

Method Overview

Rendering Results of Reconstructed Objects

Simulation in IsaacSim

Left: synthetic objects. Right: real-world objects.

Interactable Meshes of Complex Synthetic Objects

*Drag mouse to rotate & scroll wheel to zoom in/out

Bucket-100481

Table-30666

StorageFurniture-45612

Window-103015

Door-8961

Refriderator-10489

Faucet-168

Eyeglasses-101287

Printer-103811

Oven-101808

Interactable Meshes of Complex Real-world Objects

*Drag mouse to rotate & scroll wheel to zoom in/out

cabinet

cabinet

chair

coffee machine

laptop

microwave

Comparisons with Existing Methods

Two-part Objects

Multi-part Objects

BibTeX

^*Drag mouse to rotate & scroll wheel to zoom in/out

^*Drag mouse to rotate & scroll wheel to zoom in/out