To create coherent images or videos, generative AI diffusion models like Stable Diffusion or FLUX have typically relied on external "teachers"—frozen encoders like CLIP or DINOv2—to provide the ...