[![Arxiv](https://img.shields.io/badge/arXiv-2407.17331-red)](https://arxiv.org/abs/2407.17331) [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-yellow)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-label-cluster-discrimination-for-visual/self-supervised-image-classification-on)](https://paperswithcode.com/sota/self-supervised-image-classification-on?p=multi-label-cluster-discrimination-for-visual)


### Performance


#### A. MLLMs Evaluation Results
To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.


| Vision Tower                                                                                  | RoPE2D | ChartQA   | DocVQA    | InfoVQA   | OCRBench   | MMMU      |
| :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- |
| CLIP (ViT-L-14-336px)                                                                         |   ×    | 66.52     | 75.21     | 38.88     | 525.00     | 44.20     |
| SigLIP (ViT-SO400M-384px)                                                                     |   ×    | 69.28     | 76.71     | 41.38     | 554.00     | 46.78     |
| DFN5B (ViT-H-14-378px)                                                                        |   ×    | 64.36     | 70.87     | 38.59     | 473.00     | **48.00** |
| **[HF:MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)**   |   ×    | 67.84     | 76.46     | 43.48     | 531.00     | 44.30     |
| **[HF:MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336)** |   √    | 71.07     | 79.63     | 44.38     | 572.00     | 46.78     |
| **[HF:MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448)** |   √    | **73.80** | **83.34** | **46.59** | **582.00** | 46.00     |


| Vision Tower    | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
| :-------------- | :-------------------------------------------------------------------------------------- | :-------------------- |
| LLM             | Qwen2.5-7B                                                                              | Qwen2.5-7B            |
| AI2D            | **76.98**                                                                               | 73.15                 |
| GQA             | **64.17**                                                                               | 63.31                 |
| ScienceQA-Img   | **78.09**                                                                               | 76.35                 |
| InfoVQA-Val     | **43.48**                                                                               | 38.88                 |
| MMBenchCN-Dev   | **74.83**                                                                               | 72.51                 |
| MMBenchEN-Dev   | **76.37**                                                                               | 74.57                 |
| SeedBench       | **68.20**                                                                               | 66.80                 |
| SeedBench-Img   | **73.75**                                                                               | 72.72                 |
| MMStar          | **50.98**                                                                               | 48.98                 |
| MMMU            | **44.30**                                                                               | 44.20                 |
| POPE            | 88.69                                                                                   | **88.83**             |
| ChartQA         | **67.84**                                                                               | 66.52                 |
| DocVQA-Val      | **76.46**                                                                               | 75.21                 |
| TextVQA-Val     | 61.69                                                                                   | **62.47**             |
| OCRBench        | **531**                                                                                 | 525                   |
| MME(cognition)  | **432**                                                                                 | 384                   |
| MME(perception) | **1598**                                                                                | 1512                  |


#### B. Linear Probe Evaluation Results
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.


The results of the ImageNet linear probe are as follows:

| Model Name             | ImageNet Linear Probe | Hugging Face                                                                               |
| :--------------------- | :-------------------: | :----------------------------------------------------------------------------------------- |
| MLCD-ViT-B-32-224px    |         79.1          | [HF:MLCD-ViT-B-32-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224)    |
| MLCD-ViT-L-14-336px    |         86.3          | [HF:MLCD-ViT-L-14-336px](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)   |
| MLCD-ViT-bigG-14-224px |         87.1          | [HF:MLCD-ViT-bigG-14-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-224) |


| Dataset                      | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
| :--------------------------- | :-------------------- | :-------------------- |
| Food101                      | **96.21**             | 95.90                 |
| CIFAR-10                     | **99.36**             | 97.90                 |
| CIFAR-100                    | **93.69**             | 87.40                 |
| Birdsnap                     | **88.18**             | 79.90                 |
| SUN397                       | **87.96**             | 82.20                 |
| Stanford Cars                | **95.16**             | 91.50                 |
| FGVC Aircraft                | **86.38**             | 71.60                 |
| Describable Textures Dataset | **86.70**             | 83.00                 |
| Oxford-IIIT Pets             | **96.27**             | 95.10                 |
| Caltech-101                  | **97.92**             | 96.00                 |
| Flowers102                   | **99.58**             | 99.20                 |
| ImageNet                     | **86.10**             | 85.40                 |


### convert pytorch2huggingface

```python3

python convert_vit_bigG_14_rope2d_to_hf.py \
--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
--image_size 336
```