GithubHelp home page GithubHelp logo

yellowfin's Introduction

YellowFin

YellowFin is an auto-tuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measures the objective landscape on-the-fly and tunes momentum as well as learning rate using local quadratic approximation.

The implementation here can be a drop-in replacement for any optimizer in Tensorflow. It supports both minimize and apply_gradients like any tensorflow optimizer after from yellowfin import YFOptimizer. We also provide interface to manually set the learning rate schedule at every iteration for finer control (See Detailed Guideline section).

For more technical details, please refer to our paper YellowFin and the Art of Momentum Tuning.

For more usage details, please refer to the inline documentation of tuner_utils/yellowfin.py. Example usage can be found here for CIFAR and PTB.

YellowFin is under active development. Many members of the community have kindly submitted issues and pull requests. We are incorporating fixes and smoothing things out. As a result the repository code is in flux. Please make sure you use the latest version and submit any issues you might have!

Updates

[2017.08.06] Switched to logrithmic smoothing to accelerate adaptation to curvature range trends.

[2017.08.06] Added feature to correct estimation bias from sparse gradient.

[2017.08.11] Added Multipe GPU training support with better standardized code structure.

[2017.08.16] Replace numpy root solver with closed form solution using Vieta's substitution for cubic eqaution. It solves the stability issue of the numpy root solver.

[2017.10.29] Major fixe for stability. We added eps to protect fractions in our code, as well as an adaptive clipping feature to properly deal with exploding gradient (manual clipping is still supported as described in the detailed instruction below).

Setup instructions for experiments

Please clone the master branch and follow the instructions to run YellowFin on ResNet for CIFAR10, Bottleneck Resnet on CIRAR100 for image recognition, LSTM on Penn Treebank for language modeling, Char Rnn LSTM on TinyShakespeare and LSTM on Wall Street Journal dataset for constituency parsing. The CIFAR and PTB models we use are slightly adapted from official Tensorflow ResNet and LSTM. The Char Rnn LSTM and the Parsing LSTM are adapted from Char Rnn repo and Parsing LSTM repo respectively. Thanks to the researchers for developing the models.

YellowFin is tested under Tensorflow 1.1 and Python 2.7.

download data

Please use the data/download.sh script to download CIFAR10/100 and Penn Treebank dataset. It may take a few minutes depending on the network speed. Other datasets are self-included in the repo.

cd data
bash download.sh

Run CIFAR10/100 ResNets experiments

The experiments on 110 layer ResNet with CIFAR10 and 164 layer ResNet with CIFAR100 can be launched using

cd cifar/scripts
python CIFAR10-release.py --log_dir=path_to_log --opt_method=YF (for CIFAR10)
python CIFAR100-release.py --log_dir=path_to_log --opt_method=YF (for CIFAR100)

Run Penn Treebank LSTM experiments

The experiments on multiple-layer LSTM on Penn Treebank can be launched using

cd ptb/scripts
python PTB-release.py --opt_method=YF --log_dir=path_to_log

Run Char Rnn LSTM experiments

The experiments on Char Rnn LSTM with TinyShakespeare dataset can be launched using

cd char-rnn-tensorflow
python train_YF.py --log_dir=path_to_log --data_dir=./data/tinyshakespeare/ --opt_method=YF

Run constituency parsing LSTM experiments

The experiments on constituency parsing with the Wall Street Journal (WSJ) dataset can be launched using

cd parsing
mkdir -p models/wsj && python train.py --data_path=wsj --model_path=models/wsj/model --log_dir=path_to_log --opt_method="YF"

Note the WSJ is not public available. Please contact us or the author of Parsing LSTM repo for the access of the data. The data can be preprocessed following the instructions in Parsing LSTM repo. You should be able to run our scripts on the preprocessed data.

Detailed guidelines

  • Basic use: YFOptimizer() uses the uniform setting (i.e. without tuning) for all the PyTorch and Tensorflow experiments in our paper.

  • Interface for manual finer control: If you want to more finely control the learning rate, please use lr_factor in the YFOptimizer class. E.g. if you want to use a manually set constant learning rate, you can assign desired_lr / self._lr_var to self.lr_factor before applying the gradient at each iteration. If you want to use the typical lr-dropping technique after a ceritain number of epochs, please refer to the example here. (The argument learning_rate and momentum are dummy, only for backward compatibility)

  • Gradient clipping: The default setting uses adaptive gradient clipping to prevent gradient explosion, thresholding norm of gradient to the square root of our estimated maximal curvature. We recommend first fully turning off gradient clipping, and only turning it on when necessary.

    • If you want to set the clipping threshold manually, please first use use_adapt_grad_clip=False when initializing the YFOptimmizer to turn off the adaptive clipping. You may use the clip_thresh=thresh_norm_of_gradient argument when initializing the YFOptimizer to threshold the norm of gradient, or you can do the gradient clipping outside of YFOptimizer.
    • if you want to fully turn off gradient clipping inside YFOptimmizer, please set use_adapt_grad_clip=False when initializing YFOptimizer.
  • Normalization: When using log probability style losses, please make sure the loss is properly normalized. In some RNN/LSTM cases, the cross_entropy need to be averaged by the number of samples in a minibatch. Sometimes, it also needs to be averaged over the number of classes and the sequence length of each sample in some Tensorflow loss functions. E.g. the cross_etropy loss here need to be normalized by the length of sequence and minibatch size.

Citation

If you use YellowFin in your paper, please cite the paper:

@article{zhang2017yellowfin,
  title={YellowFin and the Art of Momentum Tuning},
  author={Zhang, Jian and Mitliagkas, Ioannis and R{\'e}, Christopher},
  journal={arXiv preprint arXiv:1706.03471},
  year={2017}
}

Acknowledgement

We thank Jack Hessel and Mladen Fernežir for contributing to the codebase.

Implementation for other platforms

For PyTorch users, we implemented YellowFin PyTorch repo.

We thank the contributors for YellowFin in different deep learning frameworks.

yellowfin's People

Contributors

jiangoforit avatar jmhessel avatar mfernezir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yellowfin's Issues

error in running yellowfin_test

Hello,
I want to use YF in my own code. So, first I am trying to run yellowfin_test.py but it gave me back AssertionError in line 88 of the code. Any help is appreciated!

Issue comparing to default optimizer setting in cifar10 in tensorflow tutorials

I have tried to replace the optimizer with YellowFin in cifar10 in tensorflow tutorials, but it did not perform well, much worse than the original decay sgd.

The origin code is :

  with tf.control_dependencies([loss_averages_op]):
    opt = tf.train.GradientDescentOptimizer(lr)
    grads = opt.compute_gradients(total_loss)

  # Apply gradients.
  apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

My code is:

 with tf.control_dependencies([loss_averages_op]):
        opt = YFOptimizer(lr=1.0, mu=0.0)
        # opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)
        grads = opt.compute_gradients(total_loss)
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

I simply copied the yellowfin.py from Zehaos's yellowfin.py, which added compute_gradients function.

Did I miss something?

LinAlgError (Array must not contain infs or NaNs) thrown in get_mu_tensor

Below is a simple piece of code to try YellowFin on my dataset.

x = tf.placeholder( tf.float32, [ None, train_x.shape[ 1 ] ] )
y = tf.placeholder( tf.float32, [ None, train_y.shape[ 1 ] ] )
m = tf.layers.dense( x, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, train_y.shape[ 1 ] )
prediction = tf.nn.softmax( m )
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( labels=y, logits=m ) )
optimizer = yellowfin.YFOptimizer().minimize( loss )

s = tf.Session()
s.run( tf.global_variables_initializer() )
for epoch in range( epochs ):
    _, h = s.run( [ optimizer, loss ], feed_dict={ x: train_x, y: train_y } )

Usually, it crashes and throws the following exception.

Caused by op 'update_hyper/cond/PyFuncStateless', defined at:
  File "test2.py", line 47, in <module>
    optimizer = yf.YFOptimizer( learning_rate=1., momentum=0. ).minimize( loss )
  File "/data/python-mp-test/libs/yellowfin.py", line 268, in minimize
    return self.apply_gradients(grads_and_vars)
  File "/data/python-mp-test/libs/yellowfin.py", line 223, in apply_gradients
    update_hyper_op = self.update_hyper_param()
  File "/data/python-mp-test/libs/yellowfin.py", line 191, in update_hyper_param
    lambda: self._mu_var) )
  File "/usr/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch
    original_result = fn()
  File "/data/python-mp-test/libs/yellowfin.py", line 190, in <lambda>
    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),
  File "/data/python-mp-test/libs/yellowfin.py", line 173, in get_mu_tensor
    roots = tf.py_func(np.roots, [coef], Tout=tf.complex64, stateful=False)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 201, in py_func
    input=inp, token=token, Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/gen_script_ops.py", line 56, in _py_func_stateless
    Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): LinAlgError: Array must not contain infs or NaNs
	 [[Node: update_hyper/cond/PyFuncStateless = PyFuncStateless[Tin=[DT_FLOAT], Tout=[DT_COMPLEX64], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/cpu:0"](update_hyper/cond/ScatterUpdate)]]

Keras compatability -- easy addition?

Hi! Thanks for posting this code. I thought that I would give YF a try as a drop-in optimizer. Currently, I am using Keras, and I was able to modify your code to run on Keras models by doing the following:

  • Adding a compute_gradients standalone method
  • Adding a few checks for gradients being None in apply_gradients and after_apply
  • Wrapping the YFOptimizer object in a Keras TFOptimizer

However, while it runs and my loss goes down -- I am not 100% I did everything properly. Do you think you might consider adding this support?

`lr` vs `learning_rate`

Just gave yellowfin a try yesterday and it works nicely! Just a minor comment/suggestion:

what do you think about renaming lr to learning_rate for consistency with other tensorflow optimizers. Can open a small PR

bug? lr command line argument is ignored for YF and instead 1.0 is used

In the line https://github.com/JianGoForIt/YellowFin/blob/master/char-rnn-tensorflow/model.py#L92
the lr is set to 1 and not to the command line argument value.
Later in https://github.com/JianGoForIt/YellowFin/blob/master/char-rnn-tensorflow/train_YF.py#L138
the learning is set to the command line argument value but for YF this has no effect because the connection between the variable model.lr and YF was never made (for Adam and SGD this will work because model.lr is passed as the learning rate)

Add YellowFin to tensor2tensor

I am trying to adapt YellowFin to be usable as optimizer in tensor2tensor(it's use tensorflow>=1.2.0rc1) but unfortunately i cannot debug this error:

Step to reproduce

  1. Clone this repo.
  2. Launch the starter.sh script (inside a Docker container is better).
  3. (Optional Docker container command) nvidia-docker run -it -v $(pwd):/t2t -p 6006:6006 -w /t2t tensorflow/tensorflow:latest-devel-gpu.

Error

Using YellowFin
INFO:tensorflow:Computing gradients for global model_fn.
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>):
<tf.Operation 'training/update_hyper/cond/assert_equal/Assert/Assert' type=Assert>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 955, in _train_model\n    model_fn_ops = self._get_train_ops(features, labels)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1162, in _get_train_ops\n    return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn\n    model_fn_results = self._model_fn(features, labels, **kwargs)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 520, in model_fn\n    colocate_gradients_with_ops=True)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 293, in optimize_loss\n    name="train")', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 1154, in apply_gradients\n    gradients, global_step=global_step, name=name)', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 222, in apply_gradients\n    update_hyper_op = self.update_hyper_param()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 190, in update_hyper_param\n    lambda: self._mu_var) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond\n    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch\n    original_result = fn()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 189, in <lambda>\n    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 180, in get_mu_tensor\n    tf.assert_equal(tf.size(root), tf.constant(1) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 318, in assert_equal\n    return control_flow_ops.Assert(condition, data, summarize=summarize)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-07-06 14:31:31.807218: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807260: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807285: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.855132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-07-06 14:31:31.855471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 670MX
major: 3 minor: 0 memoryClockRate (GHz) 0.601
pciBusID 0000:01:00.0
Total memory: 2.94GiB
Free memory: 2.60GiB
2017-07-06 14:31:31.855541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-07-06 14:31:31.855567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-07-06 14:31:31.855606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 670MX, pci bus id: 0000:01:00.0)
2017-07-06 14:31:32.895272: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895276: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895446: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895327: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895466: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895573: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895625: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895675: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895693: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895545: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.897115: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.901863: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902270: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902804: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903010: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903597: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904450: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904735: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.907982: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:33.041912: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model
    config=self._session_config
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 412, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

Caused by op u'global_step/read', defined at:
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 952, in _train_model
    global_step = contrib_framework.create_global_step(g)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 133, in create_global_step
    return training_util.create_global_step(graph)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py", line 119, in create_global_step
    collections=[ops.GraphKeys.GLOBAL_VARIABLES, ops.GraphKeys.GLOBAL_STEP])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 200, in __init__
    expected_shape=expected_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 319, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1303, in identity
    result = _op_def_lib.apply_op("Identity", input=input, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables_1/boolean_mask/Gather:0' shape=(?,) dtype=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model\n    config=self._session_config', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__\n    self._sess = _RecoverableSession(self._coordinated_creator)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__\n    _WrappedSession.__init__(self, self._create_session())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session\n    return self._sess_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session\n    self.tf_sess = self._session_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session\n    self._scaffold.finalize()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 192, in finalize\n    default_ready_for_local_init_op)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 254, in get_or_default\n    op = default_constructor()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 189, in default_ready_for_local_init_op\n    variables.global_variables())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================

If you do not want to help or contribute, please close the issue and forgive me.
Otherwise, i will appreciate any help :)

I've also tried to write YellowFin as an tf.train.Optimizer, but going at C++ level seems to be out of my skills at the moment...

Global Step is not updating?

As seen in Zehaos/MobileNet#27 -- the global step does not update after each training step has been taken. Is there a fix to this coming up soon? I have tried both the older version of yellowfin.py in that issue and also the latest one available. In both instances, the global step doesn't update.

I believe the issue comes from the global variable existing only within the optimizer but not globally. As a quick fix, I moved the definition of the global step (at https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L60) out of the optimizer and directly in the graph, before feeding in this variable back to the optimizer.

Is there a cleaner solution to this?

no such file or directory: '/tmp/pip-build-jykvuD/YellowFin/README.md

tensorflow) ➜  models git:(master) ✗ pip install  YellowFin
Collecting YellowFin
  Using cached Yellowfin-1.0.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-jykvuD/YellowFin/setup.py", line 7, in <module>
        with open(path.join(here, 'README.md'), encoding='utf-8') as f:
      File "/home/canoe/Project/tensorflow/lib/python2.7/codecs.py", line 896, in open
        file = __builtin__.open(filename, mode, buffering)
    IOError: [Errno 2] No such file or directory: '/tmp/pip-build-jykvuD/YellowFin/README.md'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-jykvuD/YellowFin/

Cannot use optimizer on GPU device.

Running on GPU device I get the following error:

Cannot assign a device for operation 'apply_updates/exDeepFm/embedding/embedding_layer/YellowFin': Cou
ld not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:                                                                                                                
Colocation group had the following types and devices:                                                                                 
SparseApplyMomentum: CPU                                                                                                   
Shape: GPU CPU                                                                                                                        
Square: GPU CPU                                                                                                                       
Unique: GPU CPU                                                                                                          
Cast: GPU CPU                                                                                                              
UnsortedSegmentSum: GPU CPU                                                                                                           
Identity: GPU CPU                                                                                                                     
Assign: GPU CPU                                                                                                                       
StridedSlice: GPU CPU                                                                                                                 
Const: GPU CPU                                                                                                                        
VariableV2: GPU CPU
TruncatedNormal: GPU CPU
Gather: GPU CPU
Fill: GPU CPU
Mul: GPU CPU
Add: GPU CPU

PyPI release?

It would be interesting to try out YFOptimizer but it's too tricky to install the package at the moment. Is a PyPI release in the works so we can do pip install yellowfin?

Open Source License

Thanks for sharing the yellowfin code on github, I just tried it out in one of my projects and got good results. I am just wondering if you are planning to add a open source license in future so that people can use it (of course with proper acknowledgement) in their projects and don't have to remove the yellowfin code sections when sharing their projects e.g., on GitHub.

Change license approval to integrate YF in T2T

@JianGoForIt as i said in different issues i was trying to adapt YF to be usable in tensor2tensor and after my PR to definitively integrate YF in T2T, it raised a license problem. Once the PR is accepted it will override your MIT License, so the T2T authors need your OK(approval) to keep the PR, otherwise we cannot use your code. This is the PR.

Bad performance in multiple GPUs

I used Yellowfin to train Resnet50 on ImageNet using 4 k80 GPUs and got bad performance. After 50k steps, the training loss was about 6, while the SGD without momentum and learning rate decay got only about 4.7. Any idea with this phenomenon?

Swap in replacement of AdamOptimizer causes crash

        self.opt_q =  YFOptimizer().minimize(self.vae_discriminator_loss, var_list=q_vars)
  File "xxx\src\yellowfin.py", line 215, in apply_gradients
    after_apply_op = self.after_apply()
  File "xxx\src\yellowfin.py", line 139, in after_apply
    self._grad_squared.append(tf.square(g) )
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 412, in square
    return gen_math_ops.square(x, name=name)
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2585, in square
    result = _op_def_lib.apply_op("Square", x=x, name=name)
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 509, in apply_op
    (input_name, err))
ValueError: Tried to convert 'x' to a tensor and failed. Error: None values not supported.
PS xxx>

If I switch back to Adam, it works fine. Not sure what is up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.