Arxiv Hugging Face PWC

Performance

A. MLLMs Evaluation Results

To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.

Vision Tower

RoPE2D

ChartQA

DocVQA

InfoVQA

OCRBench

MMMU

CLIP (ViT-L-14-336px)

×

66.52

75.21

38.88

525.00

44.20

SigLIP (ViT-SO400M-384px)

×

69.28

76.71

41.38

554.00

46.78

DFN5B (ViT-H-14-378px)

×

64.36

70.87

38.59

473.00

48.00

HF:MLCD (ViT-L-14-336px)

×

67.84

76.46

43.48

531.00

44.30

HF:MLCD (ViT-bigG-14-336px)

71.07

79.63

44.38

572.00

46.78

HF:MLCD (ViT-bigG-14-448px)

73.80

83.34

46.59

582.00

46.00

Vision Tower

MLCD (ViT_L_14_336px)

CLIP (ViT_L_14_336px)

LLM

Qwen2.5-7B

Qwen2.5-7B

AI2D

76.98

73.15

GQA

64.17

63.31

ScienceQA-Img

78.09

76.35

InfoVQA-Val

43.48

38.88

MMBenchCN-Dev

74.83

72.51

MMBenchEN-Dev

76.37

74.57

SeedBench

68.20

66.80

SeedBench-Img

73.75

72.72

MMStar

50.98

48.98

MMMU

44.30

44.20

POPE

88.69

88.83

ChartQA

67.84

66.52

DocVQA-Val

76.46

75.21

TextVQA-Val

61.69

62.47

OCRBench

531

525

MME(cognition)

432

384

MME(perception)

1598

1512

B. Linear Probe Evaluation Results

This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model’s weights and trains a linear classifier on top to assess how well the model’s representations generalize to different tasks.

The results of the ImageNet linear probe are as follows:

Model Name

ImageNet Linear Probe

Hugging Face

MLCD-ViT-B-32-224px

79.1

HF:MLCD-ViT-B-32-224px

MLCD-ViT-L-14-336px

86.3

HF:MLCD-ViT-L-14-336px

MLCD-ViT-bigG-14-224px

87.1

HF:MLCD-ViT-bigG-14-224px

Dataset

MLCD (ViT_L_14_336px)

CLIP (ViT_L_14_336px)

Food101

96.21

95.90

CIFAR-10

99.36

97.90

CIFAR-100

93.69

87.40

Birdsnap

88.18

79.90

SUN397

87.96

82.20

Stanford Cars

95.16

91.50

FGVC Aircraft

86.38

71.60

Describable Textures Dataset

86.70

83.00

Oxford-IIIT Pets

96.27

95.10

Caltech-101

97.92

96.00

Flowers102

99.58

99.20

ImageNet

86.10

85.40

convert pytorch2huggingface


python convert_vit_bigG_14_rope2d_to_hf.py \
--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
--image_size 336