captain-pool / gsoc Goto Github PK

Repository for Google Summer of Code 2019 https://summerofcode.withgoogle.com/projects/#4662790671826944

License: MIT License

Python 94.78% Shell 3.05% Starlark 2.16%

enhanced-super-resolution googlesummerofcode keras onnx super-resolution tensorflow tensorflow-2 tensorflow-2-sample tensorflow-datasets tf-hub tpu

gsoc's People

Contributors

Stargazers

Watchers

gsoc's Issues

Degrading Quality of Image at each iteration

The quality of the image keeps on degrading with each prediction call(...)
This happens for both the SavedModel exports of ESRGAN and Compressed ESRGAN.

Create Unittest

Create unit tests for model, custom layers and loss functions

[ESRGAN] Add Checkpointing

Setup Checkpointing for saving intermediate steps for training.

[ESRGAN] Add Network Interpolation

Implement Network Interpolation for producing final result.
Interpolation is to be done on the parameters between,
Relativistic Average Generator and PSNR based Generator.

Distribute strategy(scaling loss per_replica)?

Are you scaling the loss when using distribute strategy, or are you using the same setup as training using estimators?

Add Documentation

Documentation Needed for:

ONNX to Tensorflow Hub (SavedModel 2.0) Exporter for ShuffleNet
Sample TF Hub Module
Sample for Image Retraining using Tensorflow 2.0, TF Hub and TPUEstimator

[ESRGAN] Setup First Phase Training

Setup training for PSNR oriented model using L1 Loss.

Currently the system uses 2 Training files for 2 phases of training. This can cause some unwanted future issues and it would be really difficult to track the model.
The Following Updates are requested,

Is this code structure of having a different file per-training phase followed elsewhere? Instead you can split your original model file into one module each for G, D & the overall model. And add each training phase as a helper on the main model class - this way its easier to track the behavior of the model across its entire lifetime in the pipeline. Originally posted by @srjoglekar246 in #28

The idea is to do the following:

Update the training codes for phase_1 and phase_2 and convert the class based approach to trainer functions.
migrate all the training codes from both the phases to one file.

Add Docstrings

Add Docstrings for functions, classes and py files.

Setup Student Network

Build a residual model of lesser number of layers compared to ESRGAN to act as the student model.

Setup Teacher Network (Generator and Discriminator) from Checkpoint.

Setup loader for checkpoints from GCS and to initialize the teacher Generator and Discrimintor from ESRGAN.

[ESRGAN] Create TF 2.0 SavedModel exporter

Create final exporter for SavedModel from the Trained model.

Fail to load dataset for `train.bat` script.

When run bash train.sh, the following error occurs. How to solve it?

  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 643, in load
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/logging/__init__.py", line 168, in __call__
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 643, in load
    dbuilder = _fetch_builder(
               ^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 498, in _fetch_builder
    return builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/logging/__init__.py", line 168, in __call__
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 225, in builder
    raise not_found_error
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 202, in builder
    cls = builder_cls(str(name))
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 124, in builder_cls
    cls = registered.imported_builder_cls(str(ds_name))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cld/.conda/envs/ml-tf/lib/python3.11/site-packages/tensorflow_datasets/core/registered.py", line 296, in imported_builder_cls
    raise DatasetNotFoundError(f'Dataset {name} not found.')
tensorflow_datasets.core.registered.DatasetNotFoundError: Dataset image_label_folder not found.

Setup Training Shell Script

Setup Automation script for training the model.

Create BUILD file

Create Bazel Build Files for the Trainer.

List the required version of TensorFlow

Dear captain-pool,

Thank you for releasing this!

I've been trying to train my own copy of E2_ESRGAN and I've run into a few problems. This made me think I am on a newer version of tensorflow with a different API, so I hoped you could update the readme to list what version of tensorflow you've used to make this work?

In particular, I'm running tf.__version__ '2.2.0' and when I set both steps to false in stats.yaml and run python3 main.py --data_dir data_dir/ --log_dir log_dir/ --model_dir model_dir/ --phases "phase1_phase2" I first got the error TypeError: tf__experimental_run_v2() missing 1 required positional argument: 'kwargs' then after I resolved that I got the error TypeError: Variable is unhashable. Instead, use tensor.ref() as the key. and after I resolved that I got the error TypeError: tf__reduce() got multiple values for argument 'axis'

If you agree these errors arose because I'm on a different version of TF to you, please could you update the README to say the version you used? Thanks!

Model attempts to create variables in first call, when used with tf.function

when the call function of generator is decorated with tf.function, it raises an issue, saying Model trying to create variables on first call.

Write BUILD file.

Write Bazel BUILD file.

Add README.md

TPU Estimator Crashing

Tensorflow version: tensorflow==2.0.0b0
Tensorflow Datasets Version: tfds-nightly==1.0.2.dev201906090105
Tensorflow Hub Version: tf-hub-nightly==0.5.0.dev201905270046

Issue

Code Raises
End of sequence [[node input_pipeline_task0/while/IteratorGetNext (defined at image_retraining_tpu.py:139) ]]
for All values of max_steps in TPUEstimator.train(...)

Reproduce the issue

$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=8

The Same error rises for

--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=4

$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=100

$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=500

$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=1000

Line 139

GSOC/E1_TPU_Sample/image_retraining_tpu.py

Lines 135 to 139 in 513a0ec

 classifier.train( 

 input_fn=lambda params: input_fn( 

 mode=tf.estimator.ModeKeys.TRAIN, 

 **params), 

 max_steps=FLAGS.max_steps)

Log file

Error starts from Line 230 of output.log
output.log

CC: @srjoglekar246 @vbardiovskyg

Add README.md

Add README.md for the project.

Setup Checkpointing

Setup checkpointing for the student network

[ESRGAN] Setup Second Phase Training

Setup training of the model (initialized with weights from 1st phase) using combined loss.

Setup MSE Training

Setup Trainer for the MSE loss.

Variable Typo

model_trainable_variables Doesn't exist.

GSOC/E1_TPU_Samples/image_retraining_tpu.py

Line 98 in 09b0f6b

zip(gradient, model_trainable_variables),

Converted ShuffleNet Module Not Working

Tensorflow Version: 1.14
OS Version: Elementary OS Loki
Built from Source: No

Code to Reproduce:

$ python3 export.py

import tensorflow_hub as hub
import tensorflow as tf
module = hub.Module("onnx/shufflenet/1")
preds = module(tf.random_normal(shape=[1, 3, 224, 224], dtype=tf.float32)
with tf.Session() as sess:
  print(sess.run(preds))

Current Output

Error Thrown
PyFunc:0 not Found

Expected Output:

No Error

CC: @srjoglekar246

Setup Automation for depth of Student Network

Automatically adjust the depth parameter of the student network (d) to match up with the accuracy of the teacher network.

Add Support to Distributed GPU Training

For a huge model like this, Multi GPU training is the way to go.
Reference:
https://www.tensorflow.org/beta/tutorials/distribute/training_loops

Setup Adversarial Training.

Setup Trainer for Joint MSE and Adversarial Loss

Add Image Augmentation to introduce variation in dataset

Augment images using a function which can be mapped on the the dataset, every iteration, using tf.data.Dataset.map(...) to produce new images.

Augmentation steps should include but not limited to:

Performance issue in some programs

Hello, I found a Performance issue in in the definition of call, E3_Distill_ESRGAN/libs/models/student_rrdb.py, tf.math.add_n will created repeatedly during program execution, resulting in reduced efficiency. I think it should be created before the loop in train_model_random.

Log

Traceback (most recent call last):
  File "player.py", line 208, in <module>
    player.run()
  File "player.py", line 172, in run
    self.fetch_video()
  File "player.py", line 125, in fetch_video
    video = self.video_second()
  File "player.py", line 115, in video_second
    frames = pool.map(resolution_fn, frames)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "player.py", line 70, in tflite_super_resolve
    self.interpreter.invoke()
  File "/home/rick/tf2.0/env/lib/python3.5/site-packages/tensorflow/lite/python/interpreter.py", line 303, in invoke
    self._ensure_safe()
  File "/home/rick/tf2.0/env/lib/python3.5/site-packages/tensorflow/lite/python/interpreter.py", line 123, in _ensure_safe
    data access.""")
RuntimeError: There is at least 1 reference to internal data
      in the interpreter in the form of a numpy array or slice. Be sure to
      only hold the function returned from tensor() if you are using raw
      data access.

	classifier.train(
	input_fn=lambda params: input_fn(
	mode=tf.estimator.ModeKeys.TRAIN,
	**params),
	max_steps=FLAGS.max_steps)