Hugging Face Cross-Modal Models enable integration of text, image, and audio data in multi-modal ML models, supporting advanced applications like Text-to-Image Generation and captioning.

https://huggingface.co/models?pipeline_tag=multi-modal