Skip to content
/ MFM Public

code for paper "Masked Frequency Modeling for Self-Supervised Visual Pre-Training" (https://arxiv.org/pdf/2206.07706.pdf)

License

Notifications You must be signed in to change notification settings

CoinCheung/MFM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MFM

Unofficial code for paper "Masked Feature Prediction for Self-Supervised Visual Pre-Training" (https://arxiv.org/pdf/2206.07706.pdf)

Below are experiments with resnet50. Though better result is achieved, it seems that the baseline is also much higher than in paper.

top-1 acc pretrain finetune
paper scratch 78.1 - -
paper mfm pretrain 78.5 - -
scratch 78.542 - link
supervised pretrain 78.942 - link
mfm pretrain 78.826 link link

Note: Supervised pretrain means finetune from torchvision resnet weights (by setting pretrained=True). It seems that supervised pretrain is better than the proposed mfm pretrain.

Platform

  • pytorch 1.13.1
  • torchvision 0.14.1
  • dali 1.21.0
  • cuda 11.6
  • V100 GPU(32G) x 8
  • driver: 470.82.01

Dataset

Prepare imagenet val set in same method as pytorch official classification example, and then link them to the folder of this repo:

    $ mkdir -p imagenet
    $ ln -s /path/to/imagenet/train ./imagenet/train
    $ ln -s /path/to/imagenet/val ./imagenet/val

Train

Pretraining and finetuning Command is here.

More ablations

Here are some points that affects the results:

  1. finetune --val-resize-size
    When we eval the model after finetuning, we always resize the short side of the image to a fixed value before a center crop operation. Here I find sometimes the value of fixed short side size affects the acc by a noticeable margin. Take the "supervised pretrain" as example:

    val-resize-size 234 235 236
    top-1 acc 78.856 78.942 78.794
  2. finetune with bce loss is important
    We can see this by finetuning from scratch with CE(cross entropy) loss and BCE(binary cross entropy) loss, the result is:

    loss CE BCE
    top-1 acc 78.542 78.952
  3. pretrain random crop area
    We usually crop a part of the image with certain area ratio from the original image, and the default value of this ratio is 0.08-1.0 with torchvision RandomResizedCrop. Different self-supervised learning methods tend to prefer different random area ratios. For example, MAE uses 0.2-1.0, MAE3d uses 0.5-1.0, and SimMIM uses 0.67-1.0. Here I find a smaller lower bound of 0.2-1.0 is better:

    random area ratio 0.67-1.0 0.2-1.0 0.1-1.0
    top-1 acc 78.770 78.826 78.842

    Though here 0.1-1.0 is better than 0.2-1.0, I still use the latter, since, with 0.1-1.0, the finetuning eval result is more affacted by val-resize-size:

    val-resize-size 234 235 236
    0.2-1.0 78.816 78.826 78.796
    0.1-1.0 78.730 78.842 78.738
  4. model variance
    Here I pretrain the model for 4 times(2 on 8 v100 gpu, and 2 on 8 p40 gpu) with identical configuration. Then I finetune 3 times for each of the pretrained model(with 8 p40). Results are listed below. We can see that the results varies between a big margin. Maybe the above good results are brought by a good luck. Hence, I cannot say that I have certainly reproduced the results in the paper now.

    pretrain finetune acc1(235) mean/std
    round 1 round 1 78.654 78.644/0.024 78.621/0.08
    round 2 78.61
    round 3 78.668
    round 2 round 1 78.646 78.642/0.122
    round 2 78.79
    round 3 78.49
    round 3 round 1 78.516 78.612/0.073
    round 2 78.626
    round 3 78.694
    round 4 round 1 78.608 78.584/0.080
    round 2 78.668
    round 3 78.476