[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unicom-universal-and-compact-representation/image-retrieval-on-google-landmarks-dataset)](https://paperswithcode.com/sota/image-retrieval-on-google-landmarks-dataset?p=unicom-universal-and-compact-representation)

The model unicom was pre-trained on [laion400M](https://laion.ai/blog/laion-400-open-dataset/), and in the future, we will release the model trained on laion2B.

## Usage
First, install PyTorch 2.0 (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package.
On a CUDA GPU machine, the following will do the trick:

```shell
pip install torch torchvision
pip install tqdm timm


git clone https://github.com/deepglint/unicom
cd unicom
python
>>> import unicom
>>> unicom.available_models()
['ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']
>>> unicom.load('ViT-B/32')
  1%|▍                                      | 4.53M/385M [00:27<50:34, 132kiB/s]
```


### API

The unicom module provides the following methods:

#### `unicom.available_models()`

Returns the names of the available unicom models.

#### `unicom.load(name)`

Returns the model and the TorchVision transform needed by the model, specified by the model name returned by `unicom.available_models()`. It will download the model as necessary.

## Results and Evaluation

### Result Transfer-Learning on ImageNet1K

| Dataset    | ViT-B/32@384px | ViT-B/16@384px | ViT-L/14@518px |
| ---------- | -------------- | -------------- | -------------- |
| ImageNet1k | 83.6           | 85.9           | 88.3           |

### Result KNN on ImageNet1K
| Dataset    | ViT-B/32 | ViT-B/16 | ViT-L/14 | ViT-L/14@336px |
| ---------- | -------- | -------- | -------- | -------------- |
| ImageNet1K | 74.5     | 78.8     | 81.2     | 81.6           |


### Result of Supervised Image Retrieval

| Dataset     | ViT-B/32 | ViT-B/16 | ViT-L/14 | ViT-L/14@336px |
| ----------- | -------- | -------- | -------- | -------------- |
| SOP         | 87.1     | 88.8     | 89.9     | 91.2           |
| In-Shop     | 94.8     | 95.5     | 96.0     | 96.7           |
| INaturalist | 72.8     | 82.5     | 85.4     | 88.9           |

### Result of Zero-Shot Image Retrieval

| Dataset     | ViT-B/32 | ViT-B/16 | ViT-L/14 | ViT-L/14@336px |
| ----------- | -------- | -------- | -------- | -------------- |
| CUB         | 83.7     | 86.5     | 88.5     | 89.2           |
| Cars        | 95.9     | 96.8     | 96.9     | 97.3           |
| SOP         | 70.0     | 70.4     | 72.7     | 74.5           |
| In-Shop     | 72.8     | 74.6     | 83.6     | 86.7           |
| INaturalist | 64.6     | 73.6     | 77.1     | 81.0           |


### Eval Image Retrieval
Zero-Shot CUB Dataset with a Single GPU.  

```shell
torchrun retrieval.py --eval --dataset cub --model_name ViT-B/32
```

Zero-Shot CUB Dataset with 8 GPUs.

```shell
torchrun --nproc_per_node 8 retrieval.py --eval --dataset cub --model_name ViT-B/32
```

### Eval KNN
```shell  

torchrun --nproc_per_node 8 knn.py --train-dataset /imagenet/train/ --val-dataset /imagenet/val/ --num-workers 4 --model-name ViT-B/32
```  

## Vis ZeroShot Retrieval

#### 1. **Food-101**
![image](../_static/images/vis_food101.jpg)
#### 2. **Describable Textures Dataset**
![image](../_static/images/vis_dtd.jpg)


## GoogleLandmark

### GoogleLandmark Dataset Performance


| Model          | Public | Private | Google Drive |
| :-------------- | ------ | ------- | ----------- |
| UNICOM-ViT-B/16@512px | 32.4   | 35.7    | [Click Me](https://drive.google.com/file/d/1Vddx3ITUfscXopwcVQGOVESAmcp6M_8t/view?usp=sharing)           |
| UNICOM-ViT-L/14@512px | 33.1   | 36.4    | [Click Me](https://drive.google.com/file/d/1XCIGmEi6LxGclXuNw3wS_XZlkNSlSQW7/view?usp=sharing)           |


### Training Instructions

To successfully train the ViT-L/14 model on the GoogleLandmark dataset, ensure you have access to an NVIDIA A100 GPU with 80GB of memory and PyTorch version 2.0 or higher. Follow these detailed instructions:

Download the Dataset: Obtain the GoogleLandmark dataset and ensure it is stored in a directory accessible to your training environment.  
Create the Rec Package: Use the following commands to convert the dataset into a format suitable for training. Replace `GLDv2_PATH` with the actual path to your dataset:

```shell
python convert_google_landmark2dali.py GLDv2_PATH/train_clean.csv train.lst
python -m mxnet.tools.im2rec  --quality 100 --num-thread 32 --resize 672 train.lst GLDv2_PATH

```

The first command generates a list file (`train.lst`) from the CSV file, which describes the dataset.
The second command converts images to the RecordIO format with specified image quality and size, utilizing multiple threads for efficiency.

After preparing the dataset, you can start training the model with the following command:  

```shell
torchrun --nproc_per_node 8 finetune_GLDv2.py
```

## Citation

```latex
@inproceedings{anxiang_2024_mlcd,
  title={Multi-label Cluster Discrimination for Visual Representation Learning},
  author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
  booktitle={ECCV},
  year={2024}
}
@inproceedings{anxiang_2023_unicom,
  title={Unicom: Universal and Compact Representation Learning for Image Retrieval},
  author={An, Xiang and Deng, Jiankang and Yang, Kaicheng and Li, Jiawei and Feng, Ziyong and Guo, Jia and Yang, Jing and Liu, Tongliang},
  booktitle={ICLR},
  year={2023}
}
@inproceedings{anxiang_2022_partialfc,
    author={An, Xiang and Deng, Jiankang and Guo, Jia and Feng, Ziyong and Zhu, XuHan and Yang, Jing and Liu, Tongliang},
    title={Killing Two Birds With One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC},
    booktitle={CVPR},
    year={2022},
}
@inproceedings{deng_2019_arcface,
  title={Arcface: Additive angular margin loss for deep face recognition},
  author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
  booktitle={CVPR},
  year={2019}
}