Qwen3-Omni 全模态多语言实时交互大模型

Qwen3-Omni 是一个原生的端到端多语言全模态基础模型，能处理文本、图像、音频和视频等多种输入，能以文本和自然语音的形式提供实时流式响应。Qwen3-Omni 能实现强大的跨模态理解与生成，在文本、图像、音频及音视频任务上均达到最先进水平，特别是在36个音频/音视频基准测试中，有32个实现了开源SOTA，22个实现了整体SOTA，性能可与Gemini 2.5 Pro和GPT-4o媲美。Qwen3-Omni 采用MoE（混合专家）架构的“思考者-说话者”设计，结合AuT预训练和多编码本设计以实现低延迟和高效性。Qwen3-Omni 支持119种文本语言、19种语音输入语言和10种语音输出语言，提供高度可定制的系统提示以精细控制模型行为，包含一个专门用于详细音频描述的Qwen3-Omni-30B-A3B-Captioner模型。用户可以通过Hugging Face Transformers、vLLM（推荐用于高性能场景）或DashScope API进行部署和交互，可通过在线演示和本地Web UI体验其实时多模态对话能力，支持输出语音类型的选择和视频内音频的使用。

跨模态性能领先：先进行文本优先预训练，再开展混合多模态训练，实现原生多模态支持。在音频和音视频任务中表现出色的同时，文本和图像单模态性能并未下降。在36个音频/视频基准测试中，22个达到当前最佳水平，32个在开源模型中排名第一；语音识别、音频理解和语音对话性能与Gemini 2.5 Pro相当。

国际化支持：支持119种文本语言输入，19种语言的语音输入，10种语言的语音输出。

• 语音输入支持语言：英语、中文、韩语、日语、德语、俄语、意大利语、法语、西班牙语、葡萄牙语、马来语、荷兰语、印度尼西亚语、土耳其语、越南语、粤语、阿拉伯语、乌尔都语。

• 语音输出支持语言：英语、中文、法语、德语、俄语、意大利语、西班牙语、葡萄牙语、日语、韩语。

创新架构设计：采用基于MoE（混合专家模型）的Thinker–Talker架构，结合AuT预训练获取强大的通用表征，再通过多码本设计将延迟降至最低。

实时音视频交互：支持低延迟流式传输，具备自然的对话轮次切换能力，能快速输出文本或语音响应。

灵活控制：通过系统提示词自定义模型行为，实现精细化控制，方便适配不同场景。

精细音频描述功能：开源Qwen3-Omni-30B-A3B-Captioner模型，这是一款通用、高细节、低幻觉的音频描述模型，填补了开源社区在该领域的空白。

Qwen3-Omni的架构包含多个关键模块，各模块协同工作实现多模态数据处理：

• 文本令牌流解码器（Text Token Streaming Codec Decoder）、编解码器嵌入（Codec Embedding）、文本嵌入（Text Embedding）：处理文本类数据，将文本转化为模型可识别的嵌入向量。

• 视觉隐藏层（Vision Hidden）、音频隐藏层（Audio Hidden）、编解码器隐藏层（Codec Hidden）：分别对图像、音频、编解码后的数据进行处理，提取特征并转化为隐藏层表示。

• MTP模块（MTP Module）、填充隐藏层（Pad Hidden）：对不同模态的隐藏层数据进行整合与适配，确保数据格式统一。

• Qwen3-Omni MoE Thinker（思考模块）和Qwen3-Omni MoE Talker（输出模块）：思考模块从中间层提取隐藏层信息进行推理分析，输出模块则根据思考结果生成文本或语音输出。

• 视觉编码器AuT（Vision Encoder AuT）：处理视频数据，从高度（Height）、宽度（Width）、时间（Time）三个维度提取视频特征。

Qwen3-Omni 使用指南

1、模型下载

Qwen3-Omni提供三款不同功能的模型，可根据需求选择下载：

模型名称	说明
Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B的指令模型，包含思考模块和输出模块，支持音频、视频、文本输入，输出形式为音频和文本。详细信息可参考Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-30B-A3B的思考模型，仅包含思考模块，具备思维链推理能力，支持音频、视频、文本输入，输出形式为文本。详细信息可参考Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Captioner	基于Qwen3-Omni-30B-A3B-Instruct微调的下游音频细粒度描述模型，能为任意音频输入生成详细、低幻觉的描述内容，包含思考模块，支持音频输入和文本输出。可参考该模型的使用指南或Hugging Face演示、ModelScope演示了解更多。

下载方式：在使用Hugging Face Transformers或vLLM加载模型时，系统会根据模型名称自动下载模型权重。若运行环境不适合在执行过程中下载，可通过以下命令手动下载到本地目录：

• 通过ModelScope下载（适合中国大陆用户）：

pip install -U modelscope
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instruct
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking --local_dir ./Qwen3-Omni-30B-A3B-Thinking
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner

• 通过Hugging Face下载：

pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instruct
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Thinking --local_dir ./Qwen3-Omni-30B-A3B-Thinking
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner

2、Transformers使用方法

（1）安装依赖

Qwen3-Omni的Hugging Face Transformers代码已合并，但PyPI包尚未发布，需从源码安装。建议创建新的Python环境或使用提供的Docker镜像，避免环境冲突：

# 若已安装transformers，先卸载或创建新Python环境
# pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate

还可安装工具包方便处理各类音视频输入（支持base64、URL、交错的音视频数据），需确保系统已安装ffmpeg：

pip install qwen-omni-utils -U

使用Hugging Face Transformers时，推荐安装FlashAttention 2减少GPU内存占用（vLLM默认包含FlashAttention 2，无需额外安装）：

pip install -U flash-attn --no-build-isolation

注意：FlashAttention 2需硬件支持，且仅在模型以torch.float16或torch.bfloat16精度加载时可用，更多信息可参考FlashAttention官方文档。

（2）代码示例

基础使用示例：

import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)

processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you see and hear? Answer in one short sentence."}
        ],
    },
]

# 设置是否使用视频中的音频
USE_AUDIO_IN_VIDEO = True

# 推理准备
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, 
                   audio=audios, 
                   images=images, 
                   videos=videos, 
                   return_tensors="pt", 
                   padding=True, 
                   use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# 推理：生成输出文本和音频
text_ids, audio = model.generate(**inputs, 
                                 speaker="Ethan", 
                                 thinker_return_dict_in_generate=True,
                                 use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
                              skip_special_tokens=True,
                              clean_up_tokenization_spaces=False)
print(text)

if audio is not None:
    sf.write(
        "output.wav",
        audio.reshape(-1).detach().cpu().numpy(),
        samplerate=24000,
    )

批量推理示例：当设置return_audio=False时，模型可批量处理包含文本、图像、音频、视频等混合样本的输入：

from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
model.disable_talker()

processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

# 仅含图像的对话
conversation1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
        ]
    }
]

# 仅含音频的对话
conversation2 = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you hear in this audio?"},
        ]
    }
]

# 纯文本对话（含系统提示）
conversation3 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen-Omni."}
        ],
    },
    {
        "role": "user",
        "content": "Who are you?"
    }
]

# 混合媒体对话
conversation4 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
        ],
    }
]

# 组合对话用于批量处理
conversations = [conversation1, conversation2, conversation3, conversation4]

# 设置是否使用视频中的音频
USE_AUDIO_IN_VIDEO = True

# 批量推理准备
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)

inputs = processor(text=text, 
                   audio=audios, 
                   images=images, 
                   videos=videos, 
                   return_tensors="pt", 
                   padding=True, 
                   use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# 批量推理不支持返回音频
text_ids, audio = model.generate(**inputs,
                                 return_audio=False,
                                 thinker_return_dict_in_generate=True,
                                 use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
                              skip_special_tokens=True,
                              clean_up_tokenization_spaces=False)
print(text)

音频输出控制：

模型支持文本和音频两种输出形式。若无需音频输出，可在初始化模型后调用model.disable_talker()，此操作能节省约10GB GPU内存，但generate函数的return_audio参数只能设为False：

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
model.disable_talker()

也可在调用generate函数时通过return_audio参数灵活控制是否返回音频，设为False时模型仅输出文本，响应速度更快：

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
# ...
text_ids, _ = model.generate(..., return_audio=False)

输出音频语音类型修改：

Qwen3-Omni支持修改输出音频的语音类型，"Qwen/Qwen3-Omni-30B-A3B-Instruct"模型支持三种语音类型：

语音类型	性别	描述
Ethan	男	明亮、活泼的声音，充满感染力，风格亲切。
Chelsie	女	甜美、柔和的声音，带着温和的暖意，清晰度高。
Aiden	男	温和、轻松的美式口音，带有亲切的少年感。

可通过generate函数的speaker参数指定语音类型，默认不指定时使用Ethan语音：

text_ids, audio = model.generate(..., speaker="Ethan")
text_ids, audio = model.generate(..., speaker="Chelsie")
text_ids, audio = model.generate(..., speaker="Aiden")

更多使用细节，如提示词设置、特定任务使用方法、资源需求等，可参考使用技巧和使用指南。

3、vLLM使用方法

（1）安装依赖

推荐使用vLLM进行Qwen3-Omni系列模型的推理和部署。目前相关代码处于拉取请求阶段，Instruct模型的音频输出推理支持即将发布，可通过以下命令从源码安装vLLM。建议创建新的Python环境或使用提供的Docker镜像，避免环境冲突。更多源码编译细节可参考vLLM官方文档：

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e 、-v --no-build-isolation
# 若使用VLLM_USE_PRECOMPILED=1时出现“Undefined symbol”错误，可使用“pip install -e 、-v”从源码编译

同时安装其他必要依赖：

pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

（2）推理示例

基础推理示例：

import os
import torch
from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

if __name__ == '__main__':
    # 暂不支持vLLM engine v1
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
    # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
            max_num_seqs=8,
            max_model_len=32768,
            seed=1234,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}
            ], 
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
        "mm_processor_kwargs": {
            "use_audio_in_video": True,
        },
    }

    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios

    outputs = llm.generate([inputs], sampling_params=sampling_params)

    print(outputs[0].outputs[0].text)

批量推理示例：

使用vLLM可实现快速批量推理，高效处理大量数据或进行基准测试：

import os
import torch
from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

def build_input(processor, messages, use_audio_in_video):
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, images, videos = process_mm_info(messages, use_audio_in_video=use_audio_in_video)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
        "mm_processor_kwargs": {
            "use_audio_in_video": use_audio_in_video,
        },
    }

    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios
    
    return inputs

if __name__ == '__main__':
    # 暂不支持vLLM engine v1
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
    # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
            max_num_seqs=8,
            max_model_len=32768,
            seed=1234,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    # 仅含图像的对话
    conversation1 = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
                {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
            ]
        }
    ]

    # 仅含音频的对话
    conversation2 = [
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
                {"type": "text", "text": "What can you hear in this audio?"},
            ]
        }
    ]

    # 纯文本对话（含系统提示）
    conversation3 = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are Qwen-Omni."}
            ],
        },
        {
            "role": "user",
            "content": "Who are you? Answer in one sentence."
        }
    ]

    # 混合媒体对话
    conversation4 = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
                {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/asr_fr.wav"},
                {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
            ],
        }
    ]
    
    USE_AUDIO_IN_VIDEO = True

    # 组合对话用于批量处理
    conversations = [conversation1, conversation2, conversation3, conversation4]
    inputs = [build_input(processor, messages, USE_AUDIO_IN_VIDEO) for messages in conversations]

    outputs = llm.generate(inputs, sampling_params=sampling_params)

    result = [outputs[i].outputs[0].text for i in range(len(outputs))]
    print(result)

（3）vLLM Serve使用

目前Qwen3-Omni的vLLM serve仅支持思考模型，且不支持use_audio_in_video参数，可通过分别传递视频和音频输入处理。通过以下命令启动vLLM serve：

# 单GPU运行Qwen3-Omni-30B-A3B-Instruct
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
# 多GPU运行Qwen3-Omni-30B-A3B-Instruct（4GPU示例）
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4
# 单GPU运行Qwen/Qwen3-Omni-30B-A3B-Thinking
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
# 多GPU运行Qwen/Qwen3-Omni-30B-A3B-Thinking（4GPU示例）
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4

启动后可通过聊天API调用（如使用curl）：

curl http://localhost:8901/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
        {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
    ]}
    ]
    }'

更多使用细节，如提示词设置、特定任务使用方法、资源需求等，可参考使用技巧和使用指南。

4、DashScope API使用

若想进一步体验Qwen3-Omni，可尝试DashScope API，获取更快速高效的使用体验。相关API信息和文档如下：

API说明	API文档（中国大陆）	API文档（国际）
Qwen3-Omni-Flash离线API，包含Instruct和Thinking模型	help.aliyun.com/zh/model-studio/qwen-omni	www.alibabacloud.com/help/en/model-studio/qwen-omni
Qwen3-Omni-Flash实时API，支持端到端实时交互	help.aliyun.com/zh/model-studio/realtime	www.alibabacloud.com/help/en/model-studio/realtime
Qwen3-Omni-30B-A3B-Captioner模型API	help.aliyun.com/zh/model-studio/qwen3-omni-captioner	www.alibabacloud.com/help/zh/model-studio/qwen3-omni-captioner

5、使用技巧（推荐阅读）

（1）最低GPU内存要求

模型	精度	15秒视频	30秒视频	60秒视频	120秒视频
Qwen3-Omni-30B-A3B-Instruct	BF16	78.85 GB	88.52 GB	107.74 GB	144.81 GB
Qwen3-Omni-30B-A3B-Thinking	BF16	68.74 GB	77.79 GB	95.76 GB	131.65 GB

以上数据为使用transformers和BF16精度推理的理论最低内存要求，测试时启用attn_implementation="flash_attention_2"。Instruct模型包含思考模块和输出模块，Thinking模型仅包含思考模块。

（2）音视频交互提示词

使用Qwen3-Omni进行音视频多模态交互时（输入包含视频及其对应的音频，音频作为查询），推荐使用以下系统提示词。该设置有助于模型保持较高推理能力，同时更好地扮演智能助手等交互角色。此外，思考模块生成的文本会更易读，语气自然流畅，无难以语音化的复杂格式，让输出模块生成的音频更稳定流畅。可根据需求自定义系统提示词中的user_system_prompt字段，添加角色设定或其他特定描述：

user_system_prompt = "You are Qwen-Omni, a smart voice assistant created by Alibaba Qwen."
message = {
    "role": "system",
    "content": [
          {"type": "text", "text": f"{user_system_prompt} You are a virtual voice assistant with no gender or age.\nYou are communicating with the user.\nIn user messages, “I/me/my/we/our” refer to the user and “you/your” refer to the assistant、In your replies, address the user as “you/your” and yourself as “I/me/my”; never mirror the user’s pronouns—always shift perspective、Keep original pronouns only in direct quotes; if a reference is unclear, ask a brief clarifying question.\nInteract with users using short(no more than 50 words), brief, straightforward language, maintaining a natural tone.\nNever use formal phrasing, mechanical expressions, bullet points, overly structured language、\nYour output must consist only of the spoken content you want the user to hear、\nDo not include any descriptions of actions, emotions, sounds, or voice changes、\nDo not use asterisks, brackets, parentheses, or any other symbols to indicate tone or actions、\nYou must answer users' audio or text questions, do not directly describe the video content、\nYou should communicate in the same language strictly as the user unless they request otherwise.\nWhen you are uncertain (e.g., you can't see/hear clearly, don't understand, or the user makes a comment rather than asking a question), use appropriate questions to guide the user to continue the conversation.\nKeep replies concise and conversational, as if talking face-to-face."}
    ]
}

（3）思考模型最佳实践

Qwen3-Omni-30B-A3B-Thinking模型主要用于理解和交互文本、音频、图像、视频等多模态输入。为达到最佳性能，建议在每轮对话中，除多模态输入外，添加明确的文本指令或任务描述，帮助模型明确意图，大幅提升推理能力。示例如下：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "image", "image": "/path/to/image.png"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Analyze this audio, image, and video together."},
        ], 
    }
]

（4）视频中音频的使用

多模态交互中，用户提供的视频常附带音频（如语音提问或视频内事件声音），这些信息能帮助模型提供更好的交互体验。可通过以下方式设置是否使用视频中的音频：

# 数据预处理阶段
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
# Transformers使用时
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", 
                   padding=True, use_audio_in_video=True)
text_ids, audio = model.generate(..., use_audio_in_video=True)

# vLLM使用时
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = {
    'prompt': text,
    'multi_modal_data': {},
    "mm_processor_kwargs": {
        "use_audio_in_video": True,
    },
}

注意：多轮对话中，use_audio_in_video参数需在各步骤保持一致，否则可能出现异常结果。

四、模型交互方式

1、在线演示

无需本地部署，直接访问Hugging Face Spaces和ModelScope Studio即可体验在线网页演示，包括Qwen3-Omni-Realtime、Qwen3-Omni（Instruct和Thinking模型）、Qwen3-Omni-30B-A3B-Captioner的快速实操体验。

2、实时交互

目前Qwen3-Omni已支持实时流式交互，访问Qwen Chat，在聊天框中选择语音/视频通话选项就能体验。

3、本地Web UI演示部署

（1）安装依赖

部署前，建议参考vLLM使用中的安装部分配置环境，以便同时流畅使用vLLM和Transformers后端。若仅需使用Transformers后端（注意：推理速度会显著变慢），可参考Transformers使用中的安装说明。仍推荐使用提供的Docker镜像，避免潜在环境问题。此外，本地运行需确保系统已安装ffmpeg，并安装以下依赖：

pip install gradio==5.44.1 gradio_client==1.12.1 soundfile==0.13.1

（2）启动演示

安装所需包后，通过以下命令启动Web演示，系统会启动Web服务器并提供浏览器访问链接。运行python web_demo.py --help和python web_demo_captioner.py --help可查看更多选项：

# vLLM后端运行Qwen3-Omni-30B-A3B-Instruct
python web_demo.py -c Qwen/Qwen3-Omni-30B-A3B-Instruct
# Transformers后端运行Qwen3-Omni-30B-A3B-Instruct
python web_demo.py -c Qwen/Qwen3-Omni-30B-A3B-Instruct --use-transformers --generate-audio
# Transformers后端（含FlashAttention支持）运行Qwen3-Omni-30B-A3B-Instruct
python web_demo.py -c Qwen/Qwen3-Omni-30B-A3B-Instruct --use-transformers --generate-audio --flash-attn2
# vLLM后端运行Qwen3-Omni-30B-A3B-Thinking
python web_demo.py -c Qwen/Qwen3-Omni-30B-A3B-Thinking
# Transformers后端运行Qwen3-Omni-30B-A3B-Thinking
python web_demo.py -c Qwen/Qwen3-Omni-30B-A3B-Thinking --use-transformers
# Transformers后端（含FlashAttention支持）运行Qwen3-Omni-30B-A3B-Thinking
python web_demo.py -c Qwen/Qwen3-Omni-30B-A3B-Thinking --use-transformers --flash-attn2
# vLLM后端运行Qwen3-Omni-30B-A3B-Captioner
python web_demo_captioner.py -c Qwen/Qwen3-Omni-30B-A3B-Captioner
# Transformers后端运行Qwen3-Omni-30B-A3B-Captioner
python web_demo_captioner.py -c Qwen/Qwen3-Omni-30B-A3B-Captioner --use-transformers
# Transformers后端（含FlashAttention支持）运行Qwen3-Omni-30B-A3B-Captioner
python web_demo_captioner.py -c Qwen/Qwen3-Omni-30B-A3B-Captioner --use-transformers --flash-attn2

运行命令后，终端会生成类似如下链接：

Running on local: http://127.0.0.1:8901/

本地运行时，复制链接粘贴到浏览器即可访问Web UI；服务器或Docker容器中运行时，需根据服务器实际IP配置地址，或按需设置端口转发。Docker容器端口转发配置可参考官方指南。

五、Docker部署

为简化部署流程，提供预构建环境的Docker镜像：qwenllm/qwen3-omni。只需安装驱动并下载模型文件就能启动演示。部署前需参考指南安装NVIDIA Container Toolkit，确保Docker能访问GPU。中国大陆用户若访问Docker Hub困难，可使用镜像加速服务拉取镜像。

首先运行以下命令拉取并初始化容器：

LOCAL_WORKDIR=/path/to/your/workspace
HOST_PORT=8901
CONTAINER_PORT=80
docker run --gpus all --name qwen3-omni \
    -v /var/run/docker.sock:/var/run/docker.sock -p $HOST_PORT:$CONTAINER_PORT \
    --mount type=bind,source=$LOCAL_WORKDIR,target=/data/shared/Qwen3-Omni \
    --shm-size=4gb \
    -it qwenllm/qwen3-omni:3-cu124

执行命令后会进入容器的bash shell，本地模型和数据目录（需将/path/to/your/workspace替换为实际路径）会挂载到容器内的/data/shared/Qwen3-Omni路径。主机的8901端口映射到容器的80端口，访问主机8901端口就能使用容器内的服务。

注意：容器内服务需以0.0.0.0IP启动，确保端口转发正常。示例如下：

# 在Docker容器内运行以下命令
python web_demo.py -c Qwen/Qwen3-Omni-30B-A3B-Instruct --server-port 80 --server-name 0.0.0.0

更多Web演示启动方式可参考本地Web UI演示部署。退出容器后，可通过以下命令重新进入：

docker start qwen3-omni
docker exec -it qwen3-omni bash

若需彻底删除容器，运行：

docker rm -f qwen3-omni

六、性能评估

1、各模态性能表现

Qwen3-Omni在文本和视觉模态上保持与同规模Qwen单模态模型相当的性能，未出现下降。在36个音频和音视频基准测试中，32个在开源模型中排名第一，22个达到当前最佳水平，超过Gemini 2.5 Pro、GPT-4o等知名闭源系统。

（1）文本到文本（Text -> Text）

任务类型	评估指标	GPT-4o-0327	Qwen3-235B-A22B（非思考模型）	Qwen3-30B-A3B-Instruct-2507	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
通用任务	MMLU-Redux	91.3	89.2	89.3	86.6	86.8
通用任务	GPQA	66.9	62.9	70.4	69.6	69.7
推理任务	AIME25	26.7	24.7	61.3	65.0	65.9
推理任务	ZebraLogic	52.6	37.7	90.0	76.0	76.1
代码任务	MultiPL-E	82.7	79.3	83.8	81.4	81.5
对齐任务	IFEval	83.9	83.2	84.7	81.0	81.7
对齐任务	Creative Writing v3	84.9	80.4	86.0	80.6	81.8
对齐任务	WritingBench	75.5	77.0	85.5	82.6	83.0
智能体任务	BFCL-v3	66.5	68.0	65.1	64.4	65.0
国际化任务	MultiIF	70.4	70.2	67.9	64.0	64.7
国际化任务	PolyMATH	25.5	27.0	43.1	37.9	39.3

任务类型	评估指标	Gemini-2.5-Flash（思考模型）	Qwen3-235B-A22B（思考模型）	Qwen3-30B-A3B-Thinking-2507	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
通用任务	MMLU-Redux	92.1	92.7	91.4	88.8	89.7
通用任务	GPQA	82.8	71.1	73.4	73.1	73.1
推理任务	AIME25	72.0	81.5	85.0	73.7	74.0
推理任务	LiveBench 20241125	74.3	77.1	76.8	71.8	70.3
代码任务	MultiPL-E	84.5	79.9	81.3	80.6	81.0
对齐任务	IFEval	89.8	83.4	88.9	85.1	85.2
对齐任务	Arena-Hard v2	56.7	61.5	56.0	55.1	57.8
对齐任务	Creative Writing v3	85.0	84.6	84.4	82.5	83.6
对齐任务	WritingBench	83.9	80.3	85.0	85.5	85.9
智能体任务	BFCL-v3	68.6	70.8	72.4	63.2	64.5
国际化任务	MultiIF	74.4	71.9	76.4	72.9	73.2
国际化任务	PolyMATH	49.8	54.7	52.6	47.1	48.7

（2）音频到文本（Audio -> Text）

评估任务	Seed-ASR	Voxtral-Mini	Voxtral-Small	GPT-4o-Transcribe	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
英语&中文语音识别（wer）
Wenetspeech（网络	会议）	4.66	5.69	24.30	31.53	20.33	26.08	15.30
Librispeech（清晰	其他）	1.58	2.84	1.88	4.12	1.56	3.30	1.39
CV15-英语	-	9.47	7.79	10.01	9.89	7.61	6.05	5.94
CV15-中文	-	24.67	19.30	9.84	8.00	5.13	4.31	4.28
Fleurs-英语	3.40	3.96	3.77	3.32	2.94	3.77	2.72	2.74
Fleurs-中文	2.69	12.22	7.98	2.44	2.71	2.54	2.20	2.19
多语言语音识别（wer）
Fleurs-平均（19种语言）	-	15.67	8.09	4.48	5.55	14.04	5.33	5.31
歌词识别（wer）
MIR-1K（仅人声）	6.45	23.33	18.73	11.87	9.85	8.15	5.90	5.85
Opencpop-test	2.98	31.01	16.06	7.93	6.49	2.84	1.54	2.02
语音到文本翻译（BLEU）
Fleurs-英语到其他语言	-	30.35	37.85	-	39.25	29.22	37.50	36.22
Fleurs-其他语言到英语	-	27.54	32.81	-	35.41	28.61	31.08	30.71
Fleurs-中文到其他语言	-	17.03	22.05	-	26.63	17.97	25.17	25.10
Fleurs-其他语言到中文	-	28.75	34.82	-	37.50	27.68	33.13	31.19

评估任务	GPT-4o-Audio	Gemini-2.5-Flash	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Instruct	Qwen3-Omni-Flash-Thinking
VoiceBench
AlpacaEval	95.6	96.1	94.3	89.9	94.8	96.4	95.4	96.8
CommonEval	89.8	88.3	88.4	76.7	90.8	90.5	91.0	90.9
WildVoice	91.6	92.1	93.4	77.7	91.6	90.5	92.3	90.9
SD-QA	75.5	84.5	90.1	56.4	76.9	78.1	76.8	78.5
MMSU	80.3	66.1	71.1	61.7	68.1	83.0	68.4	84.3
OpenBookQA	89.2	56.9	92.3	80.9	89.7	94.3	91.4	95.0
BBH	84.1	83.9	92.6	66.7	80.4	88.9	80.6	89.6
IFEval	76.0	83.8	85.7	53.5	77.8	80.6	75.2	80.8
AdvBench	98.7	98.9	98.1	99.2	99.3	97.2	99.4	98.9
整体	86.8	83.4	89.6	73.6	85.5	88.8	85.6	89.5
音频推理
MMAU-v05.15.25	62.5	71.8	77.4	65.5	77.5	75.4	77.6	76.5
MMSU	56.4	70.2	77.7	62.6	69.0	70.2	69.1	71.3

评估任务	最佳专业模型	GPT-4o-Audio	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
RUL-MuchoMusic	47.6（Audio Flamingo 3）	36.1	49.4	47.3	52.0	52.1
GTZAN（准确率）	87.9（CLaMP 3）	76.5	81.0	81.7	93.0	93.1
MTG流派（Micro F1）	35.8（MuQ-MuLan）	25.3	32.6	32.5	39.0	39.5
MTG情绪/主题（Micro F1）	10.9（MuQ-MuLan）	11.3	14.1	8.9	21.0	21.7
MTG乐器（Micro F1）	39.8（MuQ-MuLan）	34.2	33.0	22.6	40.5	40.7
MTG Top50（Micro F1）	33.2（MuQ-MuLan）	25.0	26.1	21.6	36.7	36.9
MagnaTagATune（Micro F1）	41.6（MuQ）	29.2	28.1	30.1	44.3	46.8

（3）视觉到文本（Vision -> Text）

数据集	GPT4-o	Gemini-2.0-Flash	Qwen2.5-VL 72B	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
通用视觉问答
MMStar	64.7	71.4	70.8	68.5	69.3
HallusionBench	55.0	56.3	55.2	59.7	58.5
MM-MT-Bench	7.7	6.7	7.6	7.4	7.6
数学与STEM
MMMU_val	69.1	71.3	70.2	69.1	69.8
MMMU_pro	51.9	56.1	51.1	57.0	57.6
MathVista_mini	63.8	71.4	74.8	75.9	77.4
MathVision_full	30.4	48.6	38.1	56.3	58.3
文档理解
AI2D	84.6	86.7	88.7	85.2	86.4
ChartQA_test	86.7	64.6	89.5	86.8	87.1
计数
CountBench	87.9	91.2	93.6	90.0	90.0
视频理解
Video-MME	71.9	72.4	73.3	70.5	71.4
LVBench	30.8	57.9	47.3	50.2	51.1
MLVU	64.6	71.0	74.6	75.2	75.5

数据集	Gemini-2.5-flash-thinking	InternVL-3.5-241B-A28B	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
通用视觉问答
MMStar	75.5	77.9	74.9	75.5
HallusionBench	61.1	57.3	62.8	63.4
MM-MT-Bench	7.8	–	8.0	8.0
数学与STEM
MMMU_val	76.9	77.7	75.6	75.0
MMMU_pro	65.8	–	60.5	60.8
MathVista_mini	77.6	82.7	80.0	81.2
MathVision_full	62.3	63.9	62.9	63.8
文档理解
AI2D_test	88.6	87.3	86.1	86.8
ChartQA_test	–	88.0	89.5	89.3
计数
CountBench	88.6	–	88.6	92.5
视频理解
Video-MME	79.6	72.9	69.7	69.8
LVBench	64.5	–	49.0	49.5
MLVU	82.1	78.2	72.9	73.9

（4）音视频到文本（AudioVisual -> Text）

数据集	之前开源最佳水平	Gemini-2.5-Flash	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
WorldSense	47.1	50.9	45.4	54.0	54.1

数据集	之前开源最佳水平	Gemini-2.5-Flash-Thinking	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
DailyOmni	69.8	72.7	75.8	76.2
VideoHolmes	55.6	49.5	57.3	57.3

（5）零样本语音生成（Zero-shot Speech Generation）

数据集	模型	内容一致性
SEED（测试-中文	测试-英文）	Seed-TTS_{ICL}
SEED（测试-中文	测试-英文）	Seed-TTS_{RL}
SEED（测试-中文	测试-英文）	MaskGCT
SEED（测试-中文	测试-英文）	E2 TTS
SEED（测试-中文	测试-英文）	F5-TTS
SEED（测试-中文	测试-英文）	Spark TTS
SEED（测试-中文	测试-英文）	CosyVoice 2
SEED（测试-中文	测试-英文）	CosyVoice 3
SEED（测试-中文	测试-英文）	Qwen2.5-Omni-7B
SEED（测试-中文	测试-英文）	Qwen3-Omni-30B-A3B

（6）国际化语音生成（Multilingual Speech Generation）

语言	内容一致性			说话人相似度
	Qwen3-Omni-30B-A3B	MiniMax	ElevenLabs	Qwen3-Omni-30B-A3B	MiniMax	ElevenLabs
中文	0.716	2.252	16.026	0.772	0.780	0.677
英语	1.069	2.164	2.339	0.773	0.756	0.613
德语	0.777	1.906	0.572	0.738	0.733	0.614
意大利语	1.067	1.543	1.743	0.742	0.699	0.579
葡萄牙语	1.872	1.877	1.331	0.770	0.805	0.711
西班牙语	1.765	1.029	1.084	0.744	0.762	0.615
日语	3.631	3.519	10.646	0.763	0.776	0.738
韩语	1.670	1.747	1.865	0.778	0.776	0.700
法语	2.505	4.099	5.216	0.689	0.628	0.535
俄语	3.986	4.281	3.878	0.759	0.761	0.676

（7）跨语言语音生成（Cross-Lingual Speech Generation）

语言	Qwen3-Omni-30B-A3B	CosyVoice3	CosyVoice2
英语到中文	5.37	5.09	13.5
日语到中文	3.32	3.05	48.1
韩语到中文	0.99	1.06	7.70
中文到英语	2.76	2.98	6.47
日语到英语	3.31	4.20	17.1
韩语到英语	3.34	4.19	11.2
中文到日语	8.29	7.08	13.1
英语到日语	7.53	6.80	14.9
韩语到日语	4.24	3.93	5.86
中文到韩语	5.13	14.4	24.8
英语到韩语	4.96	5.87	21.9
日语到韩语	6.23	7.92	21.5

2、评估设置

（1）解码策略

所有评估基准测试中，Qwen3-Omni系列的Instruct模型生成时采用贪心解码，不进行采样；Thinking模型的解码参数需从检查点的generation_config.json文件中获取。

（2）基准测试特定格式

多数评估基准测试有自带的ChatML格式，用于嵌入问题或提示词。需注意，评估时所有视频数据均设置为fps=2。

（3）默认提示词

部分基准测试中的任务未包含提示词，可使用以下提示词设置：

任务类型	提示词
中文自动语音识别（ASR）	请将这段中文语音转换为纯文本。
其他语言自动语音识别（ASR）	Transcribe the audio into text.
语音到文本翻译（S2TT）	Listen to the provided speech and produce a translation in text.
歌词识别	Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations.

（4）系统提示词

所有评估基准测试均不设置系统提示词。

（5）输入序列

问题或提示词需作为用户文本输入。除非基准测试有特殊规定，否则文本需放在多模态数据之后。示例如下：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "image", "image": "/path/to/image.png"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe the audio, image and video."},
        ],
    },
]

▶ 访问