腾讯混元A13B MoE架构大语言模型安装使用指南

Hunyuan-A13B是一个创新且开源的大语言模型，基于细粒度MoE架构，通过130亿活跃参数（总计800亿参数）提供高效可扩展的最优性能，特别适用于资源受限环境下的高级推理和通用应用。Hunyuan-A13B核心功能包括支持快慢思维的混合推理模式、原生256K超长上下文理解能力、在智能体任务上的卓越表现。Hunyuan-A13B模型通过Grouped Query Attention (GQA) 和多量化格式实现高效推理，并且已开源预训练、指令微调、FP8和INT4量化版本，在多项基准测试（包括数学、科学、编码、推理和智能体领域）中展现出强大的竞争力，提供了使用Hugging Face Transformers进行交互、模型训练指南，通过TensorRT-LLM、vLLM和SGLang进行部署的详细支持，包括预构建的Docker镜像和量化模型部署方案。

混元A13B采用独特的架构设计，在参数规模和性能效率间实现了出色平衡。模型总参数达800亿，但仅需130亿活跃参数就能实现高性能输出，这种设计大幅降低了计算资源需求，同时保持了竞争力。

基准测试性能表现

混元A13B在多项基准测试中展现出强劲性能，以下是与其他模型的对比数据：

模型	Hunyuan-Large	Qwen2.5-72B	Qwen3-A22B	Hunyuan-A13B
MMLU	88.40	86.10	87.81	88.17
MMLU-Pro	60.20	58.10	68.18	67.23
MMLU-Redux	87.47	83.90	87.40	87.67
BBH	86.30	85.80	88.87	87.56
SuperGPQA	38.90	36.20	44.06	41.32
EvalPlus	75.69	65.93	77.60	78.64
MultiPL-E	59.13	60.50	65.94	69.33
MBPP	72.60	76.00	81.40	83.86
CRUX-I	57.00	57.63	-	70.13
CRUX-O	60.63	66.20	79.00	77.00
MATH	69.80	62.12	71.84	72.35
CMATH	91.30	84.80	-	91.17
GSM8k	92.80	91.50	94.39	91.83
GPQA	25.18	45.90	47.47	49.12

混元A13B-Instruct版本在多个领域表现突出，尤其是数学、科学和智能体任务：

主题	基准	OpenAI-o1-1217	DeepSeek R1	Qwen3-A22B	Hunyuan-A13B-Instruct
数学	AIME 2024	74.3	79.8	85.7	87.3
	AIME 2025	79.2	70	81.5	76.8
	MATH	96.4	94.9	94.0	94.3
科学	GPQA-Diamond	78	71.5	71.1	71.2
	OlympiadBench	83.1	82.4	85.7	82.7
编码	Livecodebench	63.9	65.9	70.7	63.9
	Fullstackbench	64.6	71.6	65.6	67.8
	ArtifactsBench	38.6	44.6	44.6	43
推理	BBH	80.4	83.7	88.9	89.1
	DROP	90.2	92.2	90.3	91.1
	ZebraLogic	81	78.7	80.3	84.7
指令遵循	IF-Eval	91.8	88.3	83.4	84.7
	SysBench	82.5	77.7	74.2	76.1
文本创作	LengthCtrl	60.1	55.9	53.3	55.4
	InsCtrl	74.8	69	73.7	71.9
NLU	ComplexNLU	64.7	64.5	59.8	61.2
	Word-Task	67.1	76.3	56.4	62.9
智能体	BFCL v3	67.8	56.9	70.8	78.3
	τ-Bench	60.4	43.8	44.6	54.7
	ComplexFuncBench	47.6	41.1	40.6	61.2
	C3-Bench	58.8	55.3	51.7	63.5

Transformers使用指南

模型默认使用慢速推理模式，有两种方式禁用CoT推理：

1、调用apply_chat_template时传递"enable_thinking=False"

2、在提示词前添加"/no_think"强制不使用CoT推理，添加"/think"则强制使用

以下是使用transformers库加载和应用模型的代码示例：

from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import re

model_name_or_path = os.environ['MODEL_PATH']
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt", enable_thinking=True)

outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)

output_text = tokenizer.decode(outputs[0])

think_pattern = r'<|FunctionCallBegin|>(.*?)<|FunctionCallEnd|>'
think_matches = re.findall(think_pattern, output_text, re.DOTALL)

answer_pattern = r'<|FCResponseBegin|>(.*?)<|FunctionCallEnd|>'
answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)

think_content = [match.strip() for match in think_matches][0]
answer_content = [match.strip() for match in answer_matches][0]
print(f"thinking_content:{think_content}\n\n")
print(f"answer_content:{answer_content}\n\n")

量化压缩方案

混元A13B使用自研的AngleSlim压缩工具生成FP8和INT4量化模型，该工具预计7月初开源，将支持大模型一键量化压缩。目前可直接下载量化模型进行部署测试。

FP8量化

采用FP8静态量化，通过少量校准数据预先确定量化比例，将模型权重和激活值转换为FP8格式，提升推理效率并降低部署门槛。可直接使用Hunyuan-A13B-Instruct-FP8量化模型。

FP8量化模型基准测试结果：

基准	Hunyuan-A13B-Instruct	Hunyuan-A13B-Instruct-FP8
AIME 2024	87.3	86.7
Gsm8k	94.39	94.01
BBH	89.1	88.34
DROP	91.1	91.1

INT4量化

使用GPTQ算法实现W4A16量化，逐层处理模型权重，通过少量校准数据最小化量化权重的重构误差。无需重新训练模型，仅需少量校准数据即可完成量化，提升推理效率。可直接使用Hunyuan-A13B-Instruct-Int4量化模型。

INT4量化模型基准测试结果：

基准	Hunyuan-A13B-Instruct	Hunyuan-A13B-Instruct-GPTQ-Int4
OlympiadBench	82.7	84.0
AIME 2024	87.3	86.7
Gsm8k	94.39	94.24
BBH	89.1	87.91
DROP	91.1	91.05

模型部署方案

可使用TensorRT-LLM、vLLM或SGLang等框架部署模型，并创建兼容OpenAI的API端点。

TensorRT-LLM部署

拉取Docker镜像：

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm

运行容器：

docker run --privileged --user root --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm

准备配置文件：

use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
print_iter_log: true

启动API服务器：

trtllm-serve \
  /path/to/HunYuan-moe-A13B \
  --host localhost \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 32 \
  --max_num_tokens 16384 \
  --tp_size 2 \
  --kv_cache_free_gpu_memory_fraction 0.6 \
  --trust_remote_code \
  --extra_llm_api_options /path/to/extra-llm-api-config.yml

vLLM部署

拉取Docker镜像：

docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm
或
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm

通过Huggingface下载模型：

docker run  --privileged --user root  --net=host --ipc=host \
        -v ~/.cache:/root/.cache/ \
        --gpus=all -it --entrypoint python  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
         -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
         --tensor-parallel-size 4 --model tencent/Hunyuan-A13B-Instruct --trust-remote-code

通过ModelScope下载模型：

docker run  --privileged --user root  --net=host --ipc=host \
        -v ~/.cache/modelscope:/root/.cache/modelscope \
        --gpus=all -it --entrypoint python   hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
         -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --tensor-parallel-size 4 --port 8000 \
         --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ --trust_remote_code

量化模型部署

INT8量化模型部署

设置环境变量：

export MODEL_PATH=PATH_TO_BF16_MODEL

启动服务：

sh run_server_int8.sh

INT4量化模型部署

设置环境变量：

export MODEL_PATH=PATH_TO_INT4_MODEL

启动服务：

sh run_server_int4.sh

FP8量化模型部署

设置环境变量：

export MODEL_PATH=PATH_TO_FP8_MODEL

启动服务：

sh run_server_fp8.sh

SGLang部署

拉取Docker镜像：

docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
或
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang

启动API服务器：

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    --ipc=host \
    docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
    -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000

混元A13B为开发者提供了高效、灵活的大语言模型解决方案，无论是学术研究、低成本AI方案开发，还是创新应用探索，都能为其提供强大的基础支持。快来尝试使用这款先进的开源大模型吧！

▶ 访问