LAM: Language-Augmented Model

University of Southern California,

Teaser Image Placeholder

LAM integrates language models with vision systems to enhance multimodal understanding and generation capabilities.

Abstract

This project investigates novel approaches to combining language and vision modalities for improved performance on multimodal tasks. By leveraging pre-trained language models and vision encoders, we develop systems that can better understand and reason about visual content through natural language.

Our approach introduces a language-augmented framework that seamlessly integrates linguistic understanding with visual perception. This enables more robust and generalizable vision-language models capable of handling complex reasoning tasks.

We demonstrate that LAM achieves state-of-the-art performance on various multimodal benchmarks, showing significant improvements in tasks such as visual question answering, image captioning, and cross-modal retrieval.

Method

Our method consists of three key components:

  • Language-Vision Fusion: A novel fusion mechanism that effectively combines visual features with linguistic representations.
  • Cross-Modal Attention: Attention mechanisms that enable fine-grained alignment between language and vision modalities.
  • Contextual Reasoning: Enhanced reasoning capabilities through language-guided visual understanding.

Method Diagram Placeholder

Results

We evaluate our method on multiple benchmark datasets and demonstrate significant improvements over existing approaches in various vision-language tasks. Below are interactive examples of articulated objects generated by our model.

Interactive Articulated Objects

Explore our generated articulated objects below. Use the controls to interact with different joints and observe the realistic articulation behavior.

BibTeX

@inproceedings{gao2025lam,
  author    = {Gao, Yipeng and others},
  title     = {LAM: Language-Augmented Model},
  booktitle = {Conference},
  year      = {2025},
}