Performance
A. MLLMs Evaluation Results
To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.
Vision Tower |
RoPE2D |
ChartQA |
DocVQA |
InfoVQA |
OCRBench |
MMMU |
|---|---|---|---|---|---|---|
CLIP (ViT-L-14-336px) |
× |
66.52 |
75.21 |
38.88 |
525.00 |
44.20 |
SigLIP (ViT-SO400M-384px) |
× |
69.28 |
76.71 |
41.38 |
554.00 |
46.78 |
DFN5B (ViT-H-14-378px) |
× |
64.36 |
70.87 |
38.59 |
473.00 |
48.00 |
× |
67.84 |
76.46 |
43.48 |
531.00 |
44.30 |
|
√ |
71.07 |
79.63 |
44.38 |
572.00 |
46.78 |
|
√ |
73.80 |
83.34 |
46.59 |
582.00 |
46.00 |
Vision Tower |
MLCD (ViT_L_14_336px) |
CLIP (ViT_L_14_336px) |
|---|---|---|
LLM |
Qwen2.5-7B |
Qwen2.5-7B |
AI2D |
76.98 |
73.15 |
GQA |
64.17 |
63.31 |
ScienceQA-Img |
78.09 |
76.35 |
InfoVQA-Val |
43.48 |
38.88 |
MMBenchCN-Dev |
74.83 |
72.51 |
MMBenchEN-Dev |
76.37 |
74.57 |
SeedBench |
68.20 |
66.80 |
SeedBench-Img |
73.75 |
72.72 |
MMStar |
50.98 |
48.98 |
MMMU |
44.30 |
44.20 |
POPE |
88.69 |
88.83 |
ChartQA |
67.84 |
66.52 |
DocVQA-Val |
76.46 |
75.21 |
TextVQA-Val |
61.69 |
62.47 |
OCRBench |
531 |
525 |
MME(cognition) |
432 |
384 |
MME(perception) |
1598 |
1512 |
B. Linear Probe Evaluation Results
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model’s weights and trains a linear classifier on top to assess how well the model’s representations generalize to different tasks.
The results of the ImageNet linear probe are as follows:
Model Name |
ImageNet Linear Probe |
Hugging Face |
|---|---|---|
MLCD-ViT-B-32-224px |
79.1 |
|
MLCD-ViT-L-14-336px |
86.3 |
|
MLCD-ViT-bigG-14-224px |
87.1 |
Dataset |
MLCD (ViT_L_14_336px) |
CLIP (ViT_L_14_336px) |
|---|---|---|
Food101 |
96.21 |
95.90 |
CIFAR-10 |
99.36 |
97.90 |
CIFAR-100 |
93.69 |
87.40 |
Birdsnap |
88.18 |
79.90 |
SUN397 |
87.96 |
82.20 |
Stanford Cars |
95.16 |
91.50 |
FGVC Aircraft |
86.38 |
71.60 |
Describable Textures Dataset |
86.70 |
83.00 |
Oxford-IIIT Pets |
96.27 |
95.10 |
Caltech-101 |
97.92 |
96.00 |
Flowers102 |
99.58 |
99.20 |
ImageNet |
86.10 |
85.40 |
convert pytorch2huggingface
python convert_vit_bigG_14_rope2d_to_hf.py \
--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
--image_size 336