Abstract: We study the problem of compositional zero-shot learning for object-attribute recognition. Prior works use visual features extracted with a backbone network, pre-trained for object ...
Abstract: There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results