I have followed the instructions from README. I have set up a TPU v3-8 machine which can be confirmed below:
SSH key found in project metadata; not updating instance.
SSH: Attempting to connect to worker 0...
2022-05-10 10:30:25.858388: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-10 10:30:27.319919: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-10 10:30:27.319952: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
I0510 10:30:27.335715 140289404775488 xla_bridge.py:263] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0510 10:30:27.336199 140289404775488 xla_bridge.py:263] Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter TPU Host
I0510 10:30:30.058175 140289404775488 train.py:65] Hello from process 0 holding 8/8 devices and writing to workdir gs://big_vision_exp/big_vision/workdir/05-10_1030.
I0510 10:30:30.568850 140289404775488 train.py:95] NOTE: Global batch size 1024 on 1 hosts results in 1024 local batch size. With 8 dev per host (8 dev total), that's a 128 per-device batch size.
I0510 10:30:30.570343 140289404775488 train.py:95] NOTE: Initializing train dataset...
I0510 10:30:31.039579 140289404775488 dataset_info.py:522] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I0510 10:30:31.303886 140289404775488 dataset_info.py:439] Load dataset info from /tmp/tmpggpl8znitfds
I0510 10:30:31.308489 140289404775488 dataset_info.py:492] Field info.description from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308714 140289404775488 dataset_info.py:492] Field info.release_notes from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308900 140289404775488 dataset_info.py:492] Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308959 140289404775488 dataset_info.py:492] Field info.module_name from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.309248 140289404775488 logging_logger.py:44] Constructing tf.data.Dataset imagenet2012 for split _EvenSplit(split='train[:99%]', index=0, count=1, drop_remainder=False), from gs://imagenet-1k/tensorflow_datasets/imagenet2012/5.1.0
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/spsayakpaul/big_vision/train.py", line 372, in <module>
app.run(main)
File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/spsayakpaul/big_vision/train.py", line 122, in main
train_ds = input_pipeline.make_for_train(
File "/home/spsayakpaul/big_vision/input_pipeline.py", line 69, in make_for_train
data, _ = get_dataset_tfds(dataset=dataset, split=split,
File "/home/spsayakpaul/big_vision/input_pipeline.py", line 53, in get_dataset_tfds
return builder.as_dataset(
File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 81, in decorator
return function(*args, **kwargs)
File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 565, in as_dataset
raise AssertionError(
AssertionError: Dataset imagenet2012: could not find data in gs://imagenet-1k/tensorflow_datasets. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.