The coyo-vit from kakaobrain

OverflowError: Python int too large to convert to C long

When trying to train the model following the fine-tuning instruction, I got this error:
"{path_to_my_local_folder}\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_datasets\vision_language\wit\wit.py", line 25, in
csv.field_size_limit(sys.maxsize)
OverflowError: Python int too large to convert to C long

The tutorial code doesn't work

Thank you for sharing the pretrained model.

I tried running the code in the tutorial after adding the path of the ImageNet validation dataset and checkpoint of vit-l/16 (downloaded from the huggingface page).

I placed the downloaded checkpoint in ./outputs/checkpoint as you can see in trainer.yaml file, but I got an error message Failed to find any matching files for ./outputs/checkpoint (you can see this message at the bottom of the error message below). So, I think something went wrong with the checkpoint.

So, would you please help me with this issue.

Thank you in advance.

Here is the trainer.yaml I editted.

hydra:
  run:
    dir: ./outputs/checkpoint


defaults:
  - trainer: vit_b16_i1k

runtime:
  strategy: 'gpu' # one of ['cpu', 'tpu', 'gpu', 'gpu_multinode', 'gpu_multinode_async']
  use_mixed_precision: true

experiment:
  mode: eval  # 'train', 'train_eval', 'eval'
  debug: false
  save_dir: ${hydra:run.dir}
  comment: ???

Here is the bash code I tried.

python3 -m trainer trainer=vit_l16_i1k_downstream \
experiment.debug=false \
experiment.mode='eval'

And, here is the error message below.

~:$ source test.sh
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.13.0 and strictly below 2.16.0 (nightly versions are not supported). 
 The versions of TensorFlow you are currently using is 2.10.1 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
  warnings.warn(
/home/masaru-sasaki/work_space/coyo-vit/trainer.py:323: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="configs", config_name="trainer")
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'trainer': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[2023-12-19 22:23:12,639][__main__][INFO] - Training with the following config:
trainer:
  dataset:
    train:
      cache: true
      supervised_key: label
      builder:
      - tfds_name: imagenet2012:5.0.0
        tfds_data_dir:
          your dir: null
        tfds_split: train
      dtype: bfloat16
      image_size: 384
      mixup_alpha: 0.0
      cutmix_alpha: 0.0
      preprocess:
      - type: InceptionCrop
        params:
          size: 384
      - type: random_hflip
      - type: normalize
        params:
          mean: 127.5
          std: 127.5
    validation:
      cache: true
      supervised_key: label
      builder:
      - tfds_name: imagenet2012:5.0.0
        tfds_data_dir: /mnt/disk202208/common-data/ImageNet/ILSVRC2012_img_val/
        tfds_split: validation
      dtype: bfloat16
      image_size: 384
      mixup_alpha: 0.0
      cutmix_alpha: 0.0
      preprocess:
      - type: resize
        params:
          size:
          - 384
          - 384
      - type: normalize
        params:
          mean: 127.5
          std: 127.5
  backbone:
    backbone_name: vit-l/16
    backbone_params:
      image_size: 384
      representation_size: 0
      attention_dropout_rate: 0.0
      dropout_rate: 0.0
      channels: 3
    dropout_rate: 0.0
    cls_kernel_init:
      type: zeros
    cls_bias_init:
      type: zeros
    pretrained: null
  loss:
    class_name: CategoricalCrossentropy
    config:
      from_logits: true
      label_smoothing: 0.0
    l2_weight_decay: 0.0
  learning_rate:
    schedule_name: vit/cosine
    init_lr: 0.0
    base_lr: 0.06
    end_learning_rate: 0
    warmup_steps: 500
  optimizer:
    class_name: SGD
    config:
      momentum: 0.9
      global_clipnorm: 1.0
    moving_average_decay: 0.0
  metrics:
    metrics_list:
    - class_name: TopKCategoricalAccuracy
      config:
        k: 1
        name: top1_acc
    - class_name: TopKCategoricalAccuracy
      config:
        k: 5
        name: top5_acc
    - class_name: CategoricalAccuracy
  global_batch_size: 512
  local_batch_size: null
  epochs: 8
runtime:
  strategy: gpu
  use_mixed_precision: true
experiment:
  mode: eval
  debug: false
  save_dir: ${hydra:run.dir}
  comment: ???

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
[2023-12-19 22:23:15,087][tensorflow][INFO] - Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
[2023-12-19 22:23:15,090][tensorflow][INFO] - Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
[2023-12-19 22:23:15,091][__main__][INFO] - strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7efe2a178310>
[2023-12-19 22:23:15,092][__main__][INFO] - num_workers: 4
[2023-12-19 22:23:15,092][__main__][INFO] - local_batch_size: 128, global_batch_size: 512
[2023-12-19 22:23:15,092][root][INFO] - evaluate checkpoint: ./outputs/checkpoint
[2023-12-19 22:23:15,093][__main__][INFO] - Build dataset (is_training=False)
[2023-12-19 22:23:15,093][__main__][INFO] -    [{'tfds_name': 'imagenet2012:5.0.0', 'tfds_data_dir': '/mnt/disk202208/common-data/ImageNet/ILSVRC2012_img_val/', 'tfds_split': 'validation'}]
[2023-12-19 22:23:15,093][root][INFO] - use TFDS: imagenet2012:5.0.0[validation]
[2023-12-19 22:23:15,636][absl][INFO] - Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.0.0
[2023-12-19 22:23:16,232][absl][INFO] - Load dataset info from /tmp/tmp8_aju2t8tfds
[2023-12-19 22:23:16,237][absl][INFO] - Field info.description from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.release_notes from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.citation from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.splits from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.module_name from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,239][root][INFO] - stacking dataset imagenet2012:5.0.0[validation] -> updated info: {'num_examples': 50000, 'num_shards': 64, 'num_classes': 1000}
[2023-12-19 22:23:16,575][__main__][INFO] - Build backbone (name=vit-l/16)
Model: "vision_transformer"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 pos_drop (Dropout)          multiple                  0         
                                                                 
 embedding (Conv2D)          multiple                  787456    
                                                                 
 encoderblock_0 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_1 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_2 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_3 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_4 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_5 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_6 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_7 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_8 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_9 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_10 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_11 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_12 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_13 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_14 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_15 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_16 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_17 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_18 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_19 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_20 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_21 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_22 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_23 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoder_nrom (LayerNormaliz  multiple                 2048      
 ation)                                                          
                                                                 
 extract_token (Lambda)      multiple                  0         
                                                                 
 pre_logits (Identity)       multiple                  0         
                                                                 
=================================================================
Total params: 303,690,752
Trainable params: 303,690,752
Non-trainable params: 0
_________________________________________________________________
[2023-12-19 22:23:23,693][__main__][INFO] - Compile the model...
[2023-12-19 22:23:23,694][__main__][INFO] - optimizer: <class 'keras.optimizers.optimizer_v2.gradient_descent.SGD'>
[2023-12-19 22:23:23,694][__main__][INFO] -     name: SGD
[2023-12-19 22:23:23,694][__main__][INFO] -     global_clipnorm: 1.0
[2023-12-19 22:23:23,694][__main__][INFO] -     learning_rate: 0.01
[2023-12-19 22:23:23,694][__main__][INFO] -     decay: 0.0
[2023-12-19 22:23:23,694][__main__][INFO] -     momentum: 0.9
[2023-12-19 22:23:23,694][__main__][INFO] -     nesterov: False
[2023-12-19 22:23:23,694][__main__][INFO] - Build loss: <class 'keras.losses.CategoricalCrossentropy'>
[2023-12-19 22:23:23,694][__main__][INFO] - Build metrics...
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,700][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,705][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,709][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,710][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,715][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,716][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,720][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,721][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,725][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,726][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,736][__main__][INFO] - Build callbacks...
Error executing job with overrides: ['trainer=vit_l16_i1k_downstream', 'experiment.debug=false', 'experiment.mode=eval']
Traceback (most recent call last):
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 92, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2563, in restore
    status = self.read(save_path, options=options)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2441, in read
    result = self._saver.restore(save_path=save_path, options=options)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 1448, in restore
    reader = py_checkpoint_reader.NewCheckpointReader(save_path)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 96, in NewCheckpointReader
    error_translator(e)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/masaru-sasaki/work_space/coyo-vit/trainer.py", line 340, in train_main
    trainer.eval(config.experiment.save_dir)
  File "/home/masaru-sasaki/work_space/coyo-vit/trainer.py", line 311, in eval
    checkpoint.restore(ckpt)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2567, in restore
    raise errors_impl.NotFoundError(
tensorflow.python.framework.errors_impl.NotFoundError: Error when restoring from checkpoint or SavedModel at ./outputs/checkpoint: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint
Please double-check that the path is correct. You may be missing the checkpoint suffix (e.g. the '-1' in 'path/to/ckpt-1').

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

kakaobrain / coyo-vit Goto Github PK

coyo-vit's People

Contributors

Stargazers

Watchers

coyo-vit's Issues

OverflowError: Python int too large to convert to C long

The tutorial code doesn't work

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs