PWC

The model unicom was pre-trained on laion400M, and in the future, we will release the model trained on laion2B.

Usage

First, install PyTorch 2.0 (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:

pip install torch torchvision
pip install tqdm timm


git clone https://github.com/deepglint/unicom
cd unicom
python
>>> import unicom
>>> unicom.available_models()
['ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']
>>> unicom.load('ViT-B/32')
  1%|                                      | 4.53M/385M [00:27<50:34, 132kiB/s]

API

The unicom module provides the following methods:

unicom.available_models()

Returns the names of the available unicom models.

unicom.load(name)

Returns the model and the TorchVision transform needed by the model, specified by the model name returned by unicom.available_models(). It will download the model as necessary.

Results and Evaluation

Result Transfer-Learning on ImageNet1K

Dataset

ViT-B/32@384px

ViT-B/16@384px

ViT-L/14@518px

ImageNet1k

83.6

85.9

88.3

Result KNN on ImageNet1K

Dataset

ViT-B/32

ViT-B/16

ViT-L/14

ViT-L/14@336px

ImageNet1K

74.5

78.8

81.2

81.6

Result of Supervised Image Retrieval

Dataset

ViT-B/32

ViT-B/16

ViT-L/14

ViT-L/14@336px

SOP

87.1

88.8

89.9

91.2

In-Shop

94.8

95.5

96.0

96.7

INaturalist

72.8

82.5

85.4

88.9

Result of Zero-Shot Image Retrieval

Dataset

ViT-B/32

ViT-B/16

ViT-L/14

ViT-L/14@336px

CUB

83.7

86.5

88.5

89.2

Cars

95.9

96.8

96.9

97.3

SOP

70.0

70.4

72.7

74.5

In-Shop

72.8

74.6

83.6

86.7

INaturalist

64.6

73.6

77.1

81.0

Eval Image Retrieval

Zero-Shot CUB Dataset with a Single GPU.

torchrun retrieval.py --eval --dataset cub --model_name ViT-B/32

Zero-Shot CUB Dataset with 8 GPUs.

torchrun --nproc_per_node 8 retrieval.py --eval --dataset cub --model_name ViT-B/32

Eval KNN

torchrun --nproc_per_node 8 knn.py --train-dataset /imagenet/train/ --val-dataset /imagenet/val/ --num-workers 4 --model-name ViT-B/32

Vis ZeroShot Retrieval

1. Food-101

image

2. Describable Textures Dataset

image

GoogleLandmark

GoogleLandmark Dataset Performance

Model

Public

Private

Google Drive

UNICOM-ViT-B/16@512px

32.4

35.7

Click Me

UNICOM-ViT-L/14@512px

33.1

36.4

Click Me

Training Instructions

To successfully train the ViT-L/14 model on the GoogleLandmark dataset, ensure you have access to an NVIDIA A100 GPU with 80GB of memory and PyTorch version 2.0 or higher. Follow these detailed instructions:

Download the Dataset: Obtain the GoogleLandmark dataset and ensure it is stored in a directory accessible to your training environment.
Create the Rec Package: Use the following commands to convert the dataset into a format suitable for training. Replace GLDv2_PATH with the actual path to your dataset:

python convert_google_landmark2dali.py GLDv2_PATH/train_clean.csv train.lst
python -m mxnet.tools.im2rec  --quality 100 --num-thread 32 --resize 672 train.lst GLDv2_PATH

The first command generates a list file (train.lst) from the CSV file, which describes the dataset. The second command converts images to the RecordIO format with specified image quality and size, utilizing multiple threads for efficiency.

After preparing the dataset, you can start training the model with the following command:

torchrun --nproc_per_node 8 finetune_GLDv2.py

Citation

@inproceedings{anxiang_2024_mlcd,
  title={Multi-label Cluster Discrimination for Visual Representation Learning},
  author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
  booktitle={ECCV},
  year={2024}
}
@inproceedings{anxiang_2023_unicom,
  title={Unicom: Universal and Compact Representation Learning for Image Retrieval},
  author={An, Xiang and Deng, Jiankang and Yang, Kaicheng and Li, Jiawei and Feng, Ziyong and Guo, Jia and Yang, Jing and Liu, Tongliang},
  booktitle={ICLR},
  year={2023}
}
@inproceedings{anxiang_2022_partialfc,
    author={An, Xiang and Deng, Jiankang and Guo, Jia and Feng, Ziyong and Zhu, XuHan and Yang, Jing and Liu, Tongliang},
    title={Killing Two Birds With One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC},
    booktitle={CVPR},
    year={2022},
}
@inproceedings{deng_2019_arcface,
  title={Arcface: Additive angular margin loss for deep face recognition},
  author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
  booktitle={CVPR},
  year={2019}
}