GithubHelp home page GithubHelp logo

coyo-vit's People

Contributors

justhungryman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

coyo-vit's Issues

OverflowError: Python int too large to convert to C long

When trying to train the model following the fine-tuning instruction, I got this error:
"{path_to_my_local_folder}\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_datasets\vision_language\wit\wit.py", line 25, in
csv.field_size_limit(sys.maxsize)
OverflowError: Python int too large to convert to C long

The tutorial code doesn't work

Thank you for sharing the pretrained model.

I tried running the code in the tutorial after adding the path of the ImageNet validation dataset and checkpoint of vit-l/16 (downloaded from the huggingface page).

I placed the downloaded checkpoint in ./outputs/checkpoint as you can see in trainer.yaml file, but I got an error message Failed to find any matching files for ./outputs/checkpoint (you can see this message at the bottom of the error message below). So, I think something went wrong with the checkpoint.

So, would you please help me with this issue.

Thank you in advance.


Here is the trainer.yaml I editted.

hydra:
  run:
    dir: ./outputs/checkpoint


defaults:
  - trainer: vit_b16_i1k

runtime:
  strategy: 'gpu' # one of ['cpu', 'tpu', 'gpu', 'gpu_multinode', 'gpu_multinode_async']
  use_mixed_precision: true

experiment:
  mode: eval  # 'train', 'train_eval', 'eval'
  debug: false
  save_dir: ${hydra:run.dir}
  comment: ???

Here is the bash code I tried.

python3 -m trainer trainer=vit_l16_i1k_downstream \
experiment.debug=false \
experiment.mode='eval'

And, here is the error message below.

~:$ source test.sh
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.13.0 and strictly below 2.16.0 (nightly versions are not supported). 
 The versions of TensorFlow you are currently using is 2.10.1 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
  warnings.warn(
/home/masaru-sasaki/work_space/coyo-vit/trainer.py:323: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="configs", config_name="trainer")
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'trainer': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[2023-12-19 22:23:12,639][__main__][INFO] - Training with the following config:
trainer:
  dataset:
    train:
      cache: true
      supervised_key: label
      builder:
      - tfds_name: imagenet2012:5.0.0
        tfds_data_dir:
          your dir: null
        tfds_split: train
      dtype: bfloat16
      image_size: 384
      mixup_alpha: 0.0
      cutmix_alpha: 0.0
      preprocess:
      - type: InceptionCrop
        params:
          size: 384
      - type: random_hflip
      - type: normalize
        params:
          mean: 127.5
          std: 127.5
    validation:
      cache: true
      supervised_key: label
      builder:
      - tfds_name: imagenet2012:5.0.0
        tfds_data_dir: /mnt/disk202208/common-data/ImageNet/ILSVRC2012_img_val/
        tfds_split: validation
      dtype: bfloat16
      image_size: 384
      mixup_alpha: 0.0
      cutmix_alpha: 0.0
      preprocess:
      - type: resize
        params:
          size:
          - 384
          - 384
      - type: normalize
        params:
          mean: 127.5
          std: 127.5
  backbone:
    backbone_name: vit-l/16
    backbone_params:
      image_size: 384
      representation_size: 0
      attention_dropout_rate: 0.0
      dropout_rate: 0.0
      channels: 3
    dropout_rate: 0.0
    cls_kernel_init:
      type: zeros
    cls_bias_init:
      type: zeros
    pretrained: null
  loss:
    class_name: CategoricalCrossentropy
    config:
      from_logits: true
      label_smoothing: 0.0
    l2_weight_decay: 0.0
  learning_rate:
    schedule_name: vit/cosine
    init_lr: 0.0
    base_lr: 0.06
    end_learning_rate: 0
    warmup_steps: 500
  optimizer:
    class_name: SGD
    config:
      momentum: 0.9
      global_clipnorm: 1.0
    moving_average_decay: 0.0
  metrics:
    metrics_list:
    - class_name: TopKCategoricalAccuracy
      config:
        k: 1
        name: top1_acc
    - class_name: TopKCategoricalAccuracy
      config:
        k: 5
        name: top5_acc
    - class_name: CategoricalAccuracy
  global_batch_size: 512
  local_batch_size: null
  epochs: 8
runtime:
  strategy: gpu
  use_mixed_precision: true
experiment:
  mode: eval
  debug: false
  save_dir: ${hydra:run.dir}
  comment: ???

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
[2023-12-19 22:23:15,087][tensorflow][INFO] - Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
[2023-12-19 22:23:15,090][tensorflow][INFO] - Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
[2023-12-19 22:23:15,091][__main__][INFO] - strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7efe2a178310>
[2023-12-19 22:23:15,092][__main__][INFO] - num_workers: 4
[2023-12-19 22:23:15,092][__main__][INFO] - local_batch_size: 128, global_batch_size: 512
[2023-12-19 22:23:15,092][root][INFO] - evaluate checkpoint: ./outputs/checkpoint
[2023-12-19 22:23:15,093][__main__][INFO] - Build dataset (is_training=False)
[2023-12-19 22:23:15,093][__main__][INFO] -    [{'tfds_name': 'imagenet2012:5.0.0', 'tfds_data_dir': '/mnt/disk202208/common-data/ImageNet/ILSVRC2012_img_val/', 'tfds_split': 'validation'}]
[2023-12-19 22:23:15,093][root][INFO] - use TFDS: imagenet2012:5.0.0[validation]
[2023-12-19 22:23:15,636][absl][INFO] - Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.0.0
[2023-12-19 22:23:16,232][absl][INFO] - Load dataset info from /tmp/tmp8_aju2t8tfds
[2023-12-19 22:23:16,237][absl][INFO] - Field info.description from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.release_notes from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.citation from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.splits from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.module_name from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,239][root][INFO] - stacking dataset imagenet2012:5.0.0[validation] -> updated info: {'num_examples': 50000, 'num_shards': 64, 'num_classes': 1000}
[2023-12-19 22:23:16,575][__main__][INFO] - Build backbone (name=vit-l/16)
Model: "vision_transformer"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 pos_drop (Dropout)          multiple                  0         
                                                                 
 embedding (Conv2D)          multiple                  787456    
                                                                 
 encoderblock_0 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_1 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_2 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_3 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_4 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_5 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_6 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_7 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_8 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_9 (Transformer  multiple                 12596224  
 Block)                                                          
                                                                 
 encoderblock_10 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_11 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_12 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_13 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_14 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_15 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_16 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_17 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_18 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_19 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_20 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_21 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_22 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoderblock_23 (Transforme  multiple                 12596224  
 rBlock)                                                         
                                                                 
 encoder_nrom (LayerNormaliz  multiple                 2048      
 ation)                                                          
                                                                 
 extract_token (Lambda)      multiple                  0         
                                                                 
 pre_logits (Identity)       multiple                  0         
                                                                 
=================================================================
Total params: 303,690,752
Trainable params: 303,690,752
Non-trainable params: 0
_________________________________________________________________
[2023-12-19 22:23:23,693][__main__][INFO] - Compile the model...
[2023-12-19 22:23:23,694][__main__][INFO] - optimizer: <class 'keras.optimizers.optimizer_v2.gradient_descent.SGD'>
[2023-12-19 22:23:23,694][__main__][INFO] -     name: SGD
[2023-12-19 22:23:23,694][__main__][INFO] -     global_clipnorm: 1.0
[2023-12-19 22:23:23,694][__main__][INFO] -     learning_rate: 0.01
[2023-12-19 22:23:23,694][__main__][INFO] -     decay: 0.0
[2023-12-19 22:23:23,694][__main__][INFO] -     momentum: 0.9
[2023-12-19 22:23:23,694][__main__][INFO] -     nesterov: False
[2023-12-19 22:23:23,694][__main__][INFO] - Build loss: <class 'keras.losses.CategoricalCrossentropy'>
[2023-12-19 22:23:23,694][__main__][INFO] - Build metrics...
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,700][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,705][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,709][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,710][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,715][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,716][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,720][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,721][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,725][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,726][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,736][__main__][INFO] - Build callbacks...
Error executing job with overrides: ['trainer=vit_l16_i1k_downstream', 'experiment.debug=false', 'experiment.mode=eval']
Traceback (most recent call last):
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 92, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2563, in restore
    status = self.read(save_path, options=options)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2441, in read
    result = self._saver.restore(save_path=save_path, options=options)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 1448, in restore
    reader = py_checkpoint_reader.NewCheckpointReader(save_path)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 96, in NewCheckpointReader
    error_translator(e)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/masaru-sasaki/work_space/coyo-vit/trainer.py", line 340, in train_main
    trainer.eval(config.experiment.save_dir)
  File "/home/masaru-sasaki/work_space/coyo-vit/trainer.py", line 311, in eval
    checkpoint.restore(ckpt)
  File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2567, in restore
    raise errors_impl.NotFoundError(
tensorflow.python.framework.errors_impl.NotFoundError: Error when restoring from checkpoint or SavedModel at ./outputs/checkpoint: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint
Please double-check that the path is correct. You may be missing the checkpoint suffix (e.g. the '-1' in 'path/to/ckpt-1').

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.