lambdacoldstorage / tensorflow2-tutorial Goto Github PK

Python 100.00%

tensorflow2-tutorial's Introduction

TensorFlow2-tutorial

Installation

git clone https://github.com/lambdal/TensorFlow2-tutorial.git
cd TensorFlow2-tutorial
virtualenv venv-tf2
. venv-tf2/bin/activate
pip install tf-nightly-gpu-2.0-preview==2.0.0.dev20190526

Tutorials Summary

See individual tutorial's README for details

01 Basic Image Classification

A tutorial of Image classification with ResNet.

Data pipeline with TensorFlow Dataset API
Model pipeline with Keras (TensorFlow 2's offical high level API)
Multi-GPU with distributed strategy
Customized training with callbacks (TensorBoard, Customized learning schedule)

02 Transfer Learning

This tutorial explains how to do transfer learning with TensorFlow 2. We will cover:

Handling Customized Dataset
Restore Backbone with Keras's application API
Restore backbone from disk

03 Checkpoint

This tutorial explains how use checkpoint to save and restore model during training.

Use tf.keras.ModelCheckpoint to save checkpoint
Resume training from a pre-saved checkpoint

04 Early Stopping

This tutorial explains how to implement early stopping in TensorFlow 2.

Use tf.keras.EarlyStopping callback to achieve early stopping.

05 Distributed Training Across Multi-Nodes

This tutorial explains how to do distributed training across multiple nodes:

Code boilerplate for multi-node distributed training
Run code across multiple machines

tensorflow2-tutorial's People

Contributors

Stargazers

Watchers

tensorflow2-tutorial's Issues

Maintainence of repo

Hi,

Thanks for the tutorials. They are really great!
I would like to contribute to the repo probably with new tutorials, and I have already a pull request for new tutorial about training on large datasets.

Do you maintain the repo and are you open for pull requests? If yes, is there any guideline to prepare pull requests?

Best,
Ilker

tensorflow2 version required no longer available through pip

pip install tf-nightly-gpu-2.0-preview==2.0.0.dev20190526

ERROR: Could not find a version that satisfies the requirement tf-nightly-gpu-2.0-preview==2.0.0.dev20190526 (from versions: none)
ERROR: No matching distribution found for tf-nightly-gpu-2.0-preview==2.0.0.dev20190526

Resnet56 Training Results, compared to another Tensorflow 2 Model

I'm trying to learn TF2+Keras, and came across your great examples. I'm working my way through the image classification example. I'm comparing this to what I think is the official TF 2.0 Keras-based approach(?), here, which I'll call MOVIR (after the first initials in the path of the python file, within tensorflow/models).

When I run MOVIR (vanilla, no command line arguments), I get the following training & validation results (Val Acc=93.22%):

390/390 - 19s - loss: 0.1213 - categorical_accuracy: 0.9995
78/78 - 2s - loss: 0.4276 - categorical_accuracy: 0.9322

I've modified your example for resnet_cifar.py, changing #GPUs=1 and the # of epochs and the LR schedule to match MOVIR:

182, [(0.1, 91), (0.01, 136), (0.001, 182)]

The final results I get using the modified resnet_cifar.py are (Val Acc=78.32%):

Epoch 182/182
390/390 [==============================] - 30s 78ms/step - loss: 0.1410 - sparse_categorical_accuracy: 1.0000 - val_loss: 1.1042 - val_sparse_categorical_accuracy: 0.7832
78/78 [==============================] - 1s 17ms/step - loss: 1.1042 - sparse_categorical_accuracy: 0.7832

Both are using Resnet56 and BS=128.
[I'm using Windows10, TitanXP, Python 3.7, tf-nightly-gpu-2.1.0.dev20191028.]

Questions

Do you get similar results? If so, any idea on why the val acc results are low?
Thought: I don't see how MOVIR does augmentation, so I don't see how they reach a relatively decent result.

Additional Information
I noticed that the Resnet56 model that MOVIR uses is slightly different than the one that you use. So, I thought I'd try switching them. Here is the summary (LL = Lambda Labs). So, I think the performance difference is not due to the slightly different models.

Training File	Resnet56 Model	Final Validation Accuracy	~s/Epoch
LL	LL	78.32%	31
LL	MOVIR	78.52%	27
MOVIR	LL	93.46%	21
MOVIR	MOVIR	93.22%	19

Doubt in augmentation

It is mentioned in augmentation part that train_dataset.map(augmentation) will provide inflated training dataset. I am unsure as to that just change the original inputs rather than adding to the dataset size. Could you please confirm?

Distributed training - Parameter Server Strategy

Dear Chuan Li,

Thank you very much for sharing the distributed-training example for TensorFlow MultiWorkerMirroredStrategy. Would you think it is possible to apply ParameterServerStrategy with the rest of your code unchanged?

Thank you!