LAM: Language Articulated Object Modelers

CVPR 2026

Yipeng Gao, Yunhao Ge, Peilin Cai, Daniel Seita, Laurent Itti

University of Southern California

LAM generates articulated 3D objects directly from text — co-designing geometry and articulation as a single, interpretable code representation, with no visual prior or pre-built 3D assets.

LAM overall framework: text prompt to hierarchical structure to code to articulated object.

Overall framework. From a text prompt, LAM designs a hierarchical structure, then a team of LLM/VLM agents iteratively writes, debugs, and refines code for both geometry and articulation.

Try It: Interactive Articulated Objects

These are real objects generated by LAM, exported to URDF and rendered live in your browser. Drag to rotate, scroll to zoom, and use the joint sliders to articulate each model. Use the arrows to browse examples.

Abstract

We introduce LAM, a system that explores the collaboration between large language models and vision-language models to generate articulated objects from text prompts without a visual prior or pre-built 3D assets. We formulate articulated object generation as a unified code-generation task, in which geometry and articulation are co-designed from scratch.

Given an input text, LAM coordinates a team of specialized modules to procedurally generate code that represents the desired articulated object. It first reasons about the hierarchical structure of parts (links) with a Link Designer, then writes, compiles, and debugs code with Geometry & Articulation Coders, and self-corrects via Geometry & Articulation Checkers. The code serves as a structured, interpretable bridge between individual links, ensuring correct relationships among them.

Experiments demonstrate the power of leveraging code as a generative medium within a collaborative system, showcasing its effectiveness in automatically constructing complex articulated objects.

77.1%

Joint-prediction success
(vs. 40.3% best baseline)

84.6%

Human preference
on General classes

91.7%

User preference
on Open-World classes

LAMBench
text–code–object pairs

Method

Unlike prior work that relies on structured inputs (images, videos, graphs, or meshes) and is capped by a scalability ceiling on high-part-count objects, LAM unifies the coupled problem of geometry and articulation into a single, expressive code representation. Because the cost of code is largely independent of geometric resolution, LAM can describe complex objects with a large, variable number of links — e.g. a keyboard with 20+ keys.

Link Designer

An LLM that decomposes the prompt into a hierarchy of shapes → parts → links and their relationships.

Geometry Coder

Translates the link layout into executable code, composing parametric primitives (e.g. Three.js) into each link's mesh and pose.

Articulation Coder

Generates joint code — type, parent–child hierarchy, position, and motion axis — via a Joint Assembly Solver.

Debuggers

Deterministic Python/JS checks that fix grammar and code-level errors before rendering.

Geometry & Articulation Checkers

2D & 3D VLMs render and analyze the design, giving targeted feedback in a closed-loop refinement.

LAMBench

A new dataset of 2K text–code–articulated-object pairs for systematic training and validation.

Articulation Builder. The Articulation Coder writes joint code, the Visualizer simulates motion as an image sequence, and the VLM-powered Checker provides corrective feedback — iterating until the motion is physically plausible and functionally correct.

Results

Gallery of diverse articulated 3D objects generated by LAM.

A gallery of articulated objects generated by LAM from text prompts — spanning furniture, tools, and complex open-world objects, each with functional joints.

Joint-Prediction Success Rate

On the masked-URDF reconstruction task, the default LAM* substantially surpasses prior methods, and finetuning open-source Qwen3-VL-8B on LAMBench yields large gains over its zero-shot counterpart.

Method	Five Classes ↑	General Classes ↑
Real2Code	13.5%	—
Articulate Anything	40.3%	48.9%
LAM (Qwen3-VL-8B, zero-shot)	36.8%	44.3%
LAM (Qwen3-VL-8B, finetuned)	51.6%	49.6%
LAM* (default)	77.1%	68.2%

Visual Alignment & Articulation Quality

On the shared in-distribution classes, LAM* achieves the best text–visual alignment (CLIP, BLIP) and the highest GPT-5 pass rate for functionally-correct articulation.

Method	CLIP ↑	BLIP ↑	GPT-5 pass ↑
CAGE	27.65	53.92	53.9%
SINGAPO	30.43	56.21	58.8%
Articulate Anything	28.23	56.99	65.3%
LAM (zero-shot)	27.63	55.92	66.1%
LAM (finetuned)	29.55	58.38	69.3%
LAM* (default)	31.94	63.76	77.0%

Shared classes: Storage Furniture, Table, Refrigerator, Dishwasher, Oven, Washer.

Qualitative Comparisons

Generation quality comparisons across classes.

Across classes. SINGAPO fails to produce sensible objects on out-of-distribution classes, and Articulate Anything struggles with the keyboard and scissors, whereas LAM produces collision-free, correctly articulated results.

Open-vocabulary scenarios. Across containers, tools, and complex furniture, LAM tracks part-to-part spatial relations more accurately and identifies movable components more reliably than Articulate Anything.

General & Open-World classes. (a–b) LAM achieves the best CLIP and BLIP scores; (c) both GPT-5 and human participants prefer LAM's objects — up to 91.7% user preference on Open-World classes.

Instruction-Following & Editing

Incremental editing. In four steps of adding and removing sub-objects, a one-drawer cabinet is guided to become a five-drawer cabinet — code-as-representation makes the pipeline portable and reusable.

BibTeX

@inproceedings{gao2026lam,
  title     = {LAM: Language Articulated Object Modelers},
  author    = {Gao, Yipeng and Ge, Yunhao and Cai, Peilin and Seita, Daniel and Itti, Laurent},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}