Any-to-Any Arrives on Hugging Face: Meet Qwen 2.5-Omni

Hugging Face Qwen 2.5-Omni Any-to-Any Multimodal AI Transformers

Any-to-Any Lands on Hugging Face — Meet Qwen 2.5-Omni

AI Multimodal Hugging Face Any-to-Any Qwen2.5 Transformers

If you opened the Hugging Face Tasks page lately, you may have spotted a neon-yellow badge that wasn’t there before: “Any-to-Any.” It marks a new frontier—models that can ingest multiple modalities (text, images, audio, video …) and emit multiple modalities in the same forward pass. The first flagship in that category is Alibaba Cloud’s Qwen 2.5-Omni (7 B & 3 B), already wired into transformers thanks to HF engineer Merve Noyan. Here’s the low-down in proper Murapa style.


What on earth is “Any-to-Any”?


Qwen 2.5-Omni at a glance

Params7 B (plus a lighter 3 B)
InputsText, image, audio, video
OutputsText and/or 24 kHz speech
Core tricksThinker–Talker split transformer and TMRoPE time-aligned positional embeddings

The model card spells out that Omni can “see, hear, talk and write” in real time, courtesy of the Thinker (language) tokens feeding a slim Talker (speech) head. (Hugging Face) Alibaba’s own launch post positions it as “deploy-anywhere multimodality” for phones and edge GPUs. (AlibabaCloud) The full source lives on GitHub with reproducible training utilities. (GitHub)

Transformers-first support

transformers v4.51.3 ships an Qwen2_5OmniForConditionalGeneration class; the PR landed the same week the model went public, pushed by Merve Noyan. (Hugging Face)


Quickstart: text + video in → text + speech out

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
import soundfile as sf

model_id = "Qwen/Qwen2.5-Omni-7B"
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", attn_implementation="flash_attention_2"
)
processor = Qwen2_5OmniProcessor.from_pretrained(model_id)

conversation = [
    {"role":"system","content":[{"type":"text","text":"You are Qwen, a multimodal assistant."}]},
    {"role":"user","content":[
        {"type":"video","video":"./demo_clip.mp4"},
        {"type":"text","text":"Describe what’s happening and read it aloud"}]}
]

inputs = processor.apply_chat_template(
    conversation,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    video_fps=2,
    use_audio_in_video=True,
    return_tensors="pt"
).to(model.device)

text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
print(processor.batch_decode(text_ids, skip_special_tokens=True)[0])
sf.write("reply.wav", audio.reshape(-1).cpu().numpy(), 24000)

That’s all you need—one processor handles tokenization and audio extraction; one .generate() call streams both modalities.


Why devs should care


Who else is in the Any-to-Any club?

ModelOrgClaim to fame
Janus-Pro-7BDeepSeekDecoupled visual pathways; strong text-to-image generation. (Hugging Face)
4M-21 XLEPFL-VILABToken-based framework spanning tens of modalities. (GitHub)
Chameleon-7BMetaEarly-fusion token model that writes mixed image-text docs. (VentureBeat)

The competition is heating up; Reuters notes that Alibaba’s rapid Qwen updates aim to keep pace with DeepSeek’s aggressive roadmap. (Reuters)


Final thoughts

The Any-to-Any badge is more than another checkbox in HF’s task list—it heralds a shift toward unified, real-time, multimodal systems. Qwen 2.5-Omni shows that a 7 B parameter budget is already enough to listen, watch, speak, and write at once. With transformers support baked in, you can integrate it today—no orchestration glue required.

Happy hacking, and let me know what you build with it!

Related Posts

Logo Murapadev

© 2025 Murapadev. All rights reserved.

Instagram 𝕏 GitHub Telegram LinkedIn