Abstract: Text-based Visual Question Answering (TextVQA) focuses on answering questions about the scene text in images. Most works in this field uses transformer based models to modeling the ...
Abstract: Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image ...
基于 Qwen3-ForcedAligner-0.6B 的音频-文本强制对齐工具。给定一段音频和对应文本,输出每个词/字的时间戳。 支持 CLI 命令行和 ...
TL;DR: We propose ReAlign, a plug-and-play reward-guided alignment strategy for text-to-motion generation, which explicitly enhances both semantic consistency and motion realism throughout the ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results