视觉模型家族 DINOv3 能生成高质量、通用的图像特征，无需针对特定任务进行微调（Zero-shot）

DINOv3是一系列视觉基础模型，能生成高质量密集特征，通过余弦相似度图可直观看到，模型输出特征在红色十字标记补丁与其他所有补丁间的关联。这些模型在各类视觉任务中表现突出，在多种场景下无需微调，就能超越专注特定领域的现有最优模型。

在网络数据集（LVD-1689M）上预训练的ViT模型

模型	参数数量	预训练数据集	下载链接
ViT-S/16 distilled	2100万	LVD-1689M	[link]
ViT-S+/16 distilled	2900万	LVD-1689M	[link]
ViT-B/16 distilled	8600万	LVD-1689M	[link]
ViT-L/16 distilled	3亿	LVD-1689M	[link]
ViT-H+/16 distilled	8.4亿	LVD-1689M	[link]
ViT-7B/16	67.16亿	LVD-1689M	[link]

在网络数据集（LVD-1689M）上预训练的ConvNeXt模型

模型	参数数量	预训练数据集	下载链接
ConvNeXt Tiny	2900万	LVD-1689M	[link]
ConvNeXt Small	5000万	LVD-1689M	[link]
ConvNeXt Base	8900万	LVD-1689M	[link]
ConvNeXt Large	1.98亿	LVD-1689M	[link]

在卫星数据集（SAT-493M）上预训练的ViT模型

模型	参数数量	预训练数据集	下载链接
ViT-L/16 distilled	3亿	SAT-493M	[link]
ViT-7B/16	67.16亿	SAT-493M	[link]

预训练骨干网络加载方式

借助PyTorch Hub加载

首先需按指引安装PyTorch（加载模型仅需此依赖），强烈建议安装支持CUDA的PyTorch版本。加载代码如下：

import torch

REPO_DIR = <克隆DINOv3仓库的本地目录路径>

# 加载在网络图像上预训练的DINOv3 ViT模型
dinov3_vits16 = torch.hub.load(REPO_DIR, 'dinov3_vits16', source='local', weights=<检查点/URL/或路径>)
dinov3_vits16plus = torch.hub.load(REPO_DIR, 'dinov3_vits16plus', source='local', weights=<检查点/URL/或路径>)
dinov3_vitb16 = torch.hub.load(REPO_DIR, 'dinov3_vitb16', source='local', weights=<检查点/URL/或路径>)
dinov3_vitl16 = torch.hub.load(REPO_DIR, 'dinov3_vitl16', source='local', weights=<检查点/URL/或路径>)
dinov3_vith16plus = torch.hub.load(REPO_DIR, 'dinov3_vith16plus', source='local', weights=<检查点/URL/或路径>)
dinov3_vit7b16 = torch.hub.load(REPO_DIR, 'dinov3_vit7b16', source='local', weights=<检查点/URL/或路径>)

# 加载在网络图像上预训练的DINOv3 ConvNeXt模型
dinov3_convnext_tiny = torch.hub.load(REPO_DIR, 'dinov3_convnext_tiny', source='local', weights=<检查点/URL/或路径>)
dinov3_convnext_small = torch.hub.load(REPO_DIR, 'dinov3_convnext_small', source='local', weights=<检查点/URL/或路径>)
dinov3_convnext_base = torch.hub.load(REPO_DIR, 'dinov3_convnext_base', source='local', weights=<检查点/URL/或路径>)
dinov3_convnext_large = torch.hub.load(REPO_DIR, 'dinov3_convnext_large', source='local', weights=<检查点/URL/或路径>)

# 加载在卫星图像上预训练的DINOv3 ViT模型
dinov3_vitl16 = torch.hub.load(REPO_DIR, 'dinov3_vitl16', source='local', weights=<检查点/URL/或路径>)
dinov3_vit7b16 = torch.hub.load(REPO_DIR, 'dinov3_vit7b16', source='local', weights=<检查点/URL/或路径>)

借助Hugging Face Transformers加载

所有骨干网络都可在Hugging Face Hub的DINOv3集合中找到，支持通过Hugging Face Transformers库使用。以下是获取图像嵌入的简短示例，分别通过[Pipeline]和[AutoModel]类实现：

使用Pipeline类

from transformers import pipeline
from transformers.image_utils import load_image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = load_image(url)

feature_extractor = pipeline(
    model="facebook/dinov3-convnext-tiny-pretrain-lvd1689m",
    task="image-feature-extraction",
)
features = feature_extractor(image)

使用AutoModel类

import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

pretrained_model_name = "facebook/dinov3-convnext-tiny-pretrain-lvd1689m"

processor = AutoImageProcessor.from_pretrained(pretrained_model_name)
model = AutoModel.from_pretrained(
    pretrained_model_name,
    device_map="auto",
)

inputs = processor(images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model(** inputs)

pooled_output = outputs.pooler_output

print("Pooled output shape:", pooled_output.shape)

上述代码中，model和pretrained_model_name可选用以下任意一项： • facebook/dinov3-vits16-pretrain-lvd1689m • facebook/dinov3-vits16plus-pretrain-lvd1689m • facebook/dinov3-vitb16-pretrain-lvd1689m • facebook/dinov3-vitl16-pretrain-lvd1689m • facebook/dinov3-vith16plus-pretrain-lvd1689m • facebook/dinov3-vit7b16-pretrain-lvd1689m • facebook/dinov3-convnext-base-pretrain-lvd1689m • facebook/dinov3-convnext-large-pretrain-lvd1689m • facebook/dinov3-convnext-small-pretrain-lvd1689m • facebook/dinov3-convnext-tiny-pretrain-lvd1689m • facebook/dinov3-vitl16-pretrain-sat493m • facebook/dinov3-vit7b16-pretrain-sat493m

图像变换

适用于LVD-1689M权重模型（在网络图像上预训练）

采用标准ImageNet评估变换，代码如下：

import torchvision
from torchvision import transforms

def make_transform(resize_size: int = 224):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

适用于SAT-493M权重模型（在卫星图像上预训练）

变换代码如下：

import torchvision
from torchvision import transforms

def make_transform(resize_size: int = 224):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.430, 0.411, 0.296),
        std=(0.213, 0.156, 0.143),
    )
    return transforms.Compose([to_tensor, resize, normalize])

预训练头部

图像分类

骨干网络	预训练数据集	头部数据集	下载链接
ViT-7B/16	LVD-1689M	ImageNet	[link]

可通过PyTorch Hub加载完整分类器模型，代码如下：

import torch

# 加载DINOv3分类器模型
dinov3_vit7b16_lc = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_lc', source="local", weights=<深度估计器/检查点/URL/或路径>, backbone_weights=<骨干网络/检查点/URL/或路径>)

在SYNTHMIX数据集上训练的深度估计头部

骨干网络	预训练数据集	头部数据集	下载链接
ViT-7B/16	LVD-1689M	SYNTHMIX	[link]

加载深度估计模型代码：

depther = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_dd', source="local", weights=<深度估计器/检查点/URL/或路径>, backbone_weights=<骨干网络/检查点/URL/或路径>)

深度估计器在图像上的完整示例代码

from PIL import Image
import torch
from torchvision import transforms
import matplotlib.pyplot as plt
from matplotlib import colormaps

def get_img():
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

def make_transform(resize_size: int | list[int] = 768):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

depther = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_dd', source="local", weights=<深度估计器/检查点/URL/或路径>, backbone_weights=<骨干网络/检查点/URL/或路径>)

img_size = 1024
img = get_img()

transform = make_transform(img_size)
with torch.inference_mode():
    with torch.autocast('cuda', dtype=torch.bfloat16):
        batch_img = transform(img)[None]
        batch_img = batch_img
        depths = depther(batch_img)

plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis("off")

plt.subplot(122)
plt.imshow(depths[0,0].cpu(), cmap=colormaps["Spectral"])
plt.axis("off")

在COCO2017数据集上训练的检测器头部

骨干网络	预训练数据集	头部数据集	下载链接
ViT-7B/16	LVD-1689M	COCO2017	[link]

加载检测器模型代码：

detector = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_de', source="local", weights=<检测器/检查点/URL/或路径>, backbone_weights=<骨干网络/检查点/URL/或路径>)

在ADE20K数据集上训练的分割器头部

骨干网络	预训练数据集	头部数据集	下载链接
ViT-7B/16	LVD-1689M	ADE20K	[link]

加载分割器模型代码：

segmentor = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_ms', source="local", weights=<分割器/检查点/URL/或路径>, backbone_weights=<骨干网络/检查点/URL/或路径>)

分割器在图像上的完整示例代码

import sys
sys.path.append(REPO_DIR)

from PIL import Image
import torch
from torchvision import transforms
import matplotlib.pyplot as plt
from matplotlib import colormaps
from functools import partial
from dinov3.eval.segmentation.inference import make_inference

def get_img():
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

def make_transform(resize_size: int | list[int] = 768):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

segmentor = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_ms', source="local", weights=<分割器/检查点/URL/或路径>, backbone_weights=<骨干网络/检查点/URL/或路径>)

img_size = 896
img  = get_img()

transform = make_transform(img_size)
with torch.inference_mode():
    with torch.autocast('cuda', dtype=torch.bfloat16):
        batch_img = transform(img)[None]
        pred_vit7b = segmentor(batch_img)  # 原始预测结果
        # 实际分割图
        segmentation_map_vit7b = make_inference(
            batch_img,
            segmentor,
            inference_mode="slide",
            decoder_head_type="m2f",
            rescale_to=(img.size[-1], img.size[-2]),
            n_output_channels=150,
            crop_size=(img_size, img_size),
            stride=(img_size, img_size),
            output_activation=partial(torch.nn.functional.softmax, dim=1),
        ).argmax(dim=1, keepdim=True)
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis("off")

plt.subplot(122)
plt.imshow(segmentation_map_vit7b[0,0].cpu(), cmap=colormaps["Spectral"])
plt.axis("off")

基于`dino.txt`的零样本任务

骨干网络	下载链接
ViT-L/16 distilled	[link]、词汇表、词汇表许可

可通过PyTorch Hub加载完整dino.txt模型，代码如下：

import torch

# 加载DINOv3的dino.txt模型
dinov3_vitl16_dinotxt_tet1280d2xt_tet1280d20h24l, tokenizer = torch.hub.load(REPO_DIR, 'dinov3_vitl16_dinotxt_tet1280d20h24l', weights=<分割器/检查点/URL/或路径>, backbone_weights=<骨干网络/检查点/URL/或路径>)

安装步骤

训练和评估代码需要PyTorch版本≥2.7.1，同时依赖部分第三方包。代码仅在指定版本下测试过，且需在Linux环境运行。以下是搭建训练与评估所需依赖的步骤：

推荐使用micromamba：先克隆仓库，再通过提供的环境配置文件创建并激活dinov3conda环境，命令如下：

micromamba env create -f conda.yaml
micromamba activate dinov3

数据准备

ImageNet-1k

数据集根目录需包含以下内容：

<ROOT>/test/ILSVRC2012_test_00000001.JPEG
<ROOT>/test/[..]
<ROOT>/test/ILSVRC2012_test_00100000.JPEG
<ROOT>/train/n01440764/n01440764_10026.JPEG
<ROOT>/train/[...]/n15075141/n15075141_9993.JPEG
<ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG
<ROOT>/val/[...]/n15075141/ILSVRC2012_val_00049174.JPEG
<ROOT>/labels.txt

数据集实现还要求在extra目录下存在额外元数据文件：

<EXTRA>/class-ids-TRAIN.npy
<EXTRA>/class-ids-VAL.npy
<EXTRA>/class-names-TRAIN.npy
<EXTRA>/class-names-VAL.npy
<EXTRA>/entries-TEST.npy
<EXTRA>/entries-TRAIN.npy
<EXTRA>/entries-VAL.npy

这些元数据文件可通过以下Python代码生成（仅需生成一次）：

from dinov3.data.datasets import ImageNet

for split in ImageNet.Split:
    dataset = ImageNet(split=split, root="<ROOT>", extra="<EXTRA>")
    dataset.dump_extra()

注意：root和extra目录无需为不同目录。

ImageNet-22k

需根据本地环境调整数据集类。

提示：执行后续训练和评估命令前，需将dinov3包加入Python模块搜索路径，只需在运行命令前添加PYTHONPATH=.前缀即可。

训练流程

快速设置：在ImageNet-1k上训练DINOv3 ViT-L/16

在SLURM集群环境中，使用submitit在4个H100-80GB节点（共32块GPU）上运行DINOv3预训练，命令如下：

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/train/train.py \
  --nodes 4 \
  --config-file dinov3/configs/train/vitl_im1k_lin834.yaml \
  --output-dir <输出目录路径> \
  train.dataset_path=ImageNet22k:root=<数据集路径>:extra=<数据集路径>

训练时间约14小时，生成的检查点在k-NN评估中应达到82.0%准确率，在线性评估中达到83.5%准确率。训练代码每12500次迭代会将教师模型权重保存到eval文件夹，用于后续评估。

完整DINOv3设置：训练DINOv3 ViT-7B/16

DINOv3 ViT-7B/16在私有数据集上训练，包含3个阶段：

1、预训练

在SLURM集群环境中，使用submitit在32个节点（共256块GPU）上启动DINOv3 ViT-7B/16预训练：

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/train/train.py \
  --nodes 32 \
  --config-file dinov3/configs/train/dinov3_vit7b16_pretrain.yaml \
  --output-dir <输出目录路径> \
  train.dataset_path=<数据集>:root=<数据集路径>:extra=<数据集路径>

2、Gram锚定

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/train/train.py \
  --nodes 32 \
  --config-file dinov3/configs/train/dinov3_vit7b16_gram_anchor.yaml \
  --output-dir <输出目录路径> \
  train.dataset_path=<数据集>:root=<数据集路径>:extra=<数据集路径> \
  gram.ckpt=<上一阶段Gram教师模型路径>

3、高分辨率适配

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/train/train.py \
  --nodes 32 \
  --config-file dinov3/configs/train/dinov3_vit7b16_high_res_adapt.yaml \
  --output-dir <输出目录路径> \
  train.dataset_path=<数据集>:root=<数据集路径>:extra=<数据集路径> \
  gram.ckpt=<Gram阶段教师模型路径> \
  student.resume_from_teacher_chkpt=<Gram阶段教师模型路径>

多蒸馏

测试设置命令：

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/train/train.py \
  --nodes 1 \
  --config-file dinov3/configs/train/multi_distillation_test.yaml \
  --output-dir <输出目录路径> \
  --multi-distillation \
  train.dataset_path=<数据集>:root=<数据集路径>:extra=<数据集路径>

评估方法

训练代码会定期保存教师模型权重，在单个节点上执行以下命令可完成模型评估：

ImageNet-1k逻辑回归分类

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/eval/log_regression.py \
  model.config_file=<输出目录路径>/config.yaml \
  model.pretrained_weights=<输出目录路径>/teacher_checkpoint.pth \
  output_dir=<输出目录路径> \
  train.dataset=ImageNet:split=TRAIN:root=<数据集路径>:extra=<数据集路径> \
  eval.test_dataset=ImageNet:split=VAL:root=<数据集路径>:extra=<数据集路径>

ImageNet-1k k-NN分类

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/eval/knn.py \
  model.config_file=<输出目录路径>/config.yaml \
  model.pretrained_weights=<输出目录路径>/teacher_checkpoint.pth \
  output_dir=<输出目录路径> \
  train.dataset=ImageNet:split=TRAIN:root=<数据集路径>:extra=<数据集路径> \
  eval.test_dataset=ImageNet:split=VAL:root=<数据集路径>:extra=<数据集路径>

ImageNet-1k带数据增强的线性分类

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/eval/linear.py \
  model.config_file=<输出目录路径>/config.yaml \
  model.pretrained_weights=<输出目录路径>/teacher_checkpoint.pth \
  output_dir=<输出目录路径> \
  train.dataset=ImageNet:split=TRAIN:root=<数据集路径>:extra=<数据集路径> \
  train.val_dataset=ImageNet:split=VAL:root=<数据集路径>:extra=<数据集路径>

基于dino.txt的DINOv3文本对齐

文本对齐可采用dino.txt（即DINOv2结合文本）的方法，命令如下：

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/eval/text/train_dinotxt.py \
   --nodes 4 \
  # 文本对齐示例配置文件路径：dinov3/eval/text/configs/dinov3_vitl_text.yaml \
  trainer_config_file="<DINOv3文本配置文件路径>" \
  output-dir=<输出目录路径>

上述命令在4个节点（每个节点8块GPU，共32块GPU）上训练文本对齐模型。需注意，DINOv3论文中的文本对齐模型在私有数据集上训练，此处提供的示例配置文件dinov3/eval/text/configs/dinov3_vitl_text.yaml使用CocoCaptions数据集作为演示，可根据需求调整CocoCaptions数据集类，该数据集可在对应链接获取。

▶ 访问