PrismAudio illustration
Decomposed Chain-of-Thoughts and Multi-Dimensional Rewards for Video-to-Audio Generation

Top-tier Video Generation Models + PrismAudio

Sora2 + PrismAudio

Veo3 + PrismAudio


Abstract

Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark.


The PRISMAUDIO Framework

PrismAudio illustration

Overview of PrismAudio. Left panel: the progress of CoT training data construction using Gemini 2.5 Pro and then fine-tuning VideoLLaMA2 for decomposed CoT generation. Right panel: the Fast-GRPO multi-dimensional CoT-RL framework for post-training the Audio Foundation Model.


Comparison with Baselines in Multi-Event Scenarios on the AudioCanvas Benchmark

Click any card below to load the full high-fidelity comparison and Decomposed Chain-of-Thought analysis for that scene.

Chopping, Thudding

Machining, Tapping

Ice Cutting, Freezer Shutting

Unlocking, Pouring

Smoothing, Flipping

Plastic Pop Opening

Unrolling and Tearing Tape, Smoothing

Piercing, Rubbing

Exchanging Soap, Pouring


Comparison with Baselines in Single-Event Scenarios on the AudioCanvas Benchmark

Click any card below to load the full high-fidelity comparison and Decomposed Chain-of-Thought analysis for that scene.

Trapped Cat

Accelerating, Revving, Vroom

Field Recording

Playing Double Bass

Baby Laughter

Firing Machine Gun

Carpet Cleaning

Rail Transport

Neigh, Whinny


Comparison with Baselines on VGGSound

Click any card below to load the full high-fidelity comparison and Decomposed Chain-of-Thought analysis for that scene.

Baltimore Oriole Calling

Playing Electric Guitar

Playing Drum Kit

Playing Badminton

Car Passing by

Stream Burbling

People Giggling

Panda Sneezing

People Laughing


Quantitative Results

In-domain Evaluation on VGGSound

Objective and Subjective evaluations on the in-domain VGGSound test set.
MethodParamsSemanticTemporalAesthetic QualitySpatial AccuracyDistributionSubjectiveTime (s)
CLAP↑DeSync↓PQ↑PC↓CE↑CU↑GCC↓CRW↓FD↓KL↓MOS-Q↑MOS-C↑
GT-0.460.556.303.854.405.65----4.58±0.184.65±0.15-
Frieren†159M0.320.855.903.503.575.35--1.342.863.45±0.753.51±0.80-
V2A-Mapper†229M0.311.236.263.544.125.63--0.902.493.38±0.823.44±0.88-
AudioX1.1B0.411.245.943.433.865.447.2219.251.511.803.61±0.753.65±0.727.52
HunyuanVideo-Foley5.31B0.420.555.853.263.925.26--2.261.733.88±0.553.96±0.5210.63
MMAudio1.03B0.400.465.943.513.885.28--2.171.323.95±0.514.03±0.581.30
ThinkSound1.3B0.430.556.153.533.955.484.6513.471.171.354.05±0.554.18±0.511.07
PrismAudio w/o CoT-RL518M0.420.516.173.323.945.484.0610.291.141.434.02±0.484.11±0.420.63
PrismAudio (Ours)518M0.470.416.383.244.295.683.777.721.081.234.21±0.354.22±0.290.63

Out-of-Domain Evaluation on AudioCanvas

Objective and Subjective evaluations on the out-of-domain AudioCanvas benchmark.
MethodSemanticTemporalAesthetic QualitySpatial AccuracyDistributionSubjective
CLAP↑DeSync↓PQ↑PC↓CE↑CU↑GCC↓CRW↓FD↓KL↓MOS-Q↑MOS-C↑
GT0.480.406.473.164.025.99----4.65±0.234.72±0.20
HunyuanVideo-Foley0.440.476.433.254.045.88--2.042.073.75±0.523.71±0.58
MMAudio0.460.436.303.233.975.77--3.591.873.88±0.453.87±0.41
ThinkSound0.480.806.483.504.105.944.4322.821.952.543.79±0.583.80±0.54
PrismAudio w/o CoT-RL0.420.446.453.223.815.874.1115.302.102.173.91±0.353.85±0.31
PrismAudio (Ours)0.520.366.682.824.266.153.5012.871.921.534.12±0.284.01±0.25

BibTex

If you find our work useful, please cite our paper:
@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
          title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation}, 
          author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue},
          year={2025},
          eprint={2511.18833},
          archivePrefix={arXiv},
          primaryClass={cs.SD},
          url={https://arxiv.org/abs/2511.18833}, 
    }