Unofficial code for paper "Masked Feature Prediction for Self-Supervised Visual Pre-Training" (https://arxiv.org/pdf/2206.07706.pdf)
Below are experiments with resnet50. Though better result is achieved, it seems that the baseline is also much higher than in paper.
top-1 acc | pretrain | finetune | |
---|---|---|---|
paper scratch | 78.1 | - | - |
paper mfm pretrain | 78.5 | - | - |
scratch | 78.542 | - | link |
supervised pretrain | 78.942 | - | link |
mfm pretrain | 78.826 | link | link |
Note: Supervised pretrain means finetune from torchvision resnet weights (by setting pretrained=True
). It seems that supervised pretrain is better than the proposed mfm pretrain.
- pytorch 1.13.1
- torchvision 0.14.1
- dali 1.21.0
- cuda 11.6
- V100 GPU(32G) x 8
- driver: 470.82.01
Prepare imagenet val set in same method as pytorch official classification example, and then link them to the folder of this repo:
$ mkdir -p imagenet
$ ln -s /path/to/imagenet/train ./imagenet/train
$ ln -s /path/to/imagenet/val ./imagenet/val
Pretraining and finetuning Command is here.
Here are some points that affects the results:
-
finetune
--val-resize-size
When we eval the model after finetuning, we always resize the short side of the image to a fixed value before a center crop operation. Here I find sometimes the value of fixed short side size affects the acc by a noticeable margin. Take the "supervised pretrain" as example:val-resize-size 234 235 236 top-1 acc 78.856 78.942 78.794 -
finetune with bce loss is important
We can see this by finetuning from scratch with CE(cross entropy) loss and BCE(binary cross entropy) loss, the result is:loss CE BCE top-1 acc 78.542 78.952 -
pretrain random crop area
We usually crop a part of the image with certain area ratio from the original image, and the default value of this ratio is0.08-1.0
with torchvisionRandomResizedCrop
. Different self-supervised learning methods tend to prefer different random area ratios. For example, MAE uses0.2-1.0
, MAE3d uses0.5-1.0
, and SimMIM uses0.67-1.0
. Here I find a smaller lower bound of0.2-1.0
is better:random area ratio 0.67-1.0 0.2-1.0 0.1-1.0 top-1 acc 78.770 78.826 78.842 Though here
0.1-1.0
is better than0.2-1.0
, I still use the latter, since, with0.1-1.0
, the finetuning eval result is more affacted byval-resize-size
:val-resize-size 234 235 236 0.2-1.0 78.816 78.826 78.796 0.1-1.0 78.730 78.842 78.738 -
model variance
Here I pretrain the model for 4 times(2 on 8 v100 gpu, and 2 on 8 p40 gpu) with identical configuration. Then I finetune 3 times for each of the pretrained model(with 8 p40). Results are listed below. We can see that the results varies between a big margin. Maybe the above good results are brought by a good luck. Hence, I cannot say that I have certainly reproduced the results in the paper now.pretrain finetune acc1(235) mean/std round 1 round 1 78.654 78.644/0.024 78.621/0.08 round 2 78.61 round 3 78.668 round 2 round 1 78.646 78.642/0.122 round 2 78.79 round 3 78.49 round 3 round 1 78.516 78.612/0.073 round 2 78.626 round 3 78.694 round 4 round 1 78.608 78.584/0.080 round 2 78.668 round 3 78.476