googlecloudplatform / cloudml-samples Goto Github PK

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples

Home Page: https://cloud.google.com/ai-platform/docs/

License: Apache License 2.0

Python 50.71% Shell 2.18% Jupyter Notebook 46.69% OpenEdge ABL 0.08% Dockerfile 0.34%

cloudml cloudml-samples keras-tensorflow keras gcp samples

cloudml-samples's People

Contributors

Stargazers

Watchers

Forkers

puneith rfp30 chao-jiang gegilligan ajschumacher dolaameng mikebird28 fneukart gilcarmel vicros mdbconsulting yeladlouni debtaru randomm andrewbuss 87sanchavan ohmystack kyate bw4sz daangn ranaalisaeed kascia mrgoogol slalomconsultingsf mansoorahmedsyed capidium7 bbarnes52-zz toshihiros znewman01 rafayhashmi thefirstofthe300 ayyar yissylevi coindata jlertle lukechen526 stgk4 elibixby allensmile manifestlifeinc olivercjohnson machinelearningorg teora hgoycoolea mingderwang embeddedsamurai daiza datascienceit albertpadin psoulos masanao-miyamoto srfrnk jdengmoloco yufengg masakazuohmura rhaertel80 mausilio carlespoles mdeguzman827 analyticswithabhishek nrabieen jimfleming ml-lab petermitrano alain-m bytegain brandondutra haradakiyohide whodak gdtm86 remicuingnet techonomics69 tahuh techonomicsinc akurniawan jialutu zomglings yu-iskw aosterloh jandemx waprin heavy3 sultf2 robertsaxby valerakaravai markcutajar shafiahmed eucliss gabrielspmoreira llkey221 thampiman ankitkumar031 amagrabi fedefalco92 satoshirobatofujimoto aipachakutiqwan sgangoly javasergo zhoudaqing ochachacha

cloudml-samples's Issues

Flowers sample pipeline fails, update to Apache Beam

I've been using the pipeline.py for several weeks without problem. I was gone for a week and it seems that there has been a change. I am now receiving

(c31f81dff566e4d3): Workflow failed. Causes: (c31f81dff566e6d5): Unable to bring up enough workers: minimum 1, actual 0. Please check your quota and retry later, or please try in a different zone.
when calling

python pipeline.py \
    --project ${PROJECT} \
    --cloud \
    --train_input_path gs://api-project-773889352370-ml/Hummingbirds/trainingdata.csv \
    --eval_input_path gs://api-project-773889352370-ml/Hummingbirds/testingdata.csv \
    --input_dict gs://api-project-773889352370-ml/Hummingbirds/dict.txt \
    --deploy_model_name "DeepMeerkat" \
    --gcs_bucket ${BUCKET} \
    --output_dir "${GCS_PATH}/" \
    --sample_image_uri  gs://api-project-773889352370-ml/Hummingbirds/Positives/10000.jpg

My repo is here, but its essentially cloned and changed path names

https://github.com/bw4sz/DeepMeerkat/blob/master/training/pipeline.py

I know there have been some changes to python apache beam? Related?

I am running on a cloud compute engine instance, authenticated without issue.

Missing regularization in flowers model

This is not exactly an issue, more a missing feature: In the flowers model, the loss function only takes into account data loss and no regularization (such as L2 regularization) is possible except dropout. Currently I am retraining the model with my own dataset and it suffers from overfitting, therefore the addition of regularization might help.
Is this choice due to ML considerations or the regularization has not been included just to keep the example simple and because the flowers-dataset didn't need it?

Unable to create model version - Model artifacts size exceeds 1GB

Hi,
I'm trying to deploy a version of my model with cloud ml. My training artifacts are 3GB.
I get the following error message. I was wondering if there was a way around this.

ERROR: (gcloud.beta.ml.models.versions.create) FAILED_PRECONDITION: Field: version.deployment_uri Error: The total size of files in gs://model-export-bucket/ is 3118066753 bytes, which exceeds the allowed maximum of 1073741824 bytes.

Thanks,
Surya

Updates to Flowers tutorial gcloud beta ml syntax.

It looks like there are a few changes to the gcloud beta ml syntax since this tutorial was made.

In sample.sh,

line 6 reads

PROJECT=$(gcloud config list project --format "value(core.project)")

should read

PROJECT=$(gcloud beta config list project --format "value(core.project)")

lines 70 and 76,
gcloud beta ml versions

is now.
gcloud beta ml models versions

also note that USER is never defined in the tutorial, but is required or else the GCS_PATH has an awkward __ space in it, and dataflow doesn't seem to like the empty folder name.

This fixed sample.sh for me, running from the docker instance specified here

docker pull gcr.io/cloud-datalab/datalab:local

Unable to run preprocess.py for reddit_tft -- additional documentation needed

This tutorial looks awesome and very useful for what I am trying to learn. However, I am trying to run the reddit_tft data preprocessing step and I am encountering some issues that I think could be made more clear in the README documentation.

System information

Running via the active cloud shell within the cloud console
Tensorflow 1.2.1 (~/cloudml-samples-master/reddit_tft$ pip freeze |grep tensorflow
tensorflow==1.2.1)
Python 2.7.9 (:~/cloudml-samples-master/reddit_tft$ python Python 2.7.9 (default, Jun 29 2016, 13:08:31)
[GCC 4.9.2] on linux2
tensorflow-transform-0.1.10

Describe the problem

pip install tensorflow-transform does not work. The documentation links here but that documentation simply says to run 'pip intall tensorflow-transform'. However, to run on the cloud you actually need 'sudo pip install tensorflow-transform' in order to have the correct permissions (see stack trace below). This should be explicitly called out.

The quick start tutorial only makes sure you have "A GCP account with the Cloud ML Engine and Cloud Storage APIs activated" but for this you also need the Dataflow API's activated

I think that in the preprocessing step you should simply call out the commands needed to be run from the tutorial as it wasn't immediately obvious to me.

wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip
unzip master.zip
cd cloudml-samples-master/reddit_tft

Source code / logs

The traceback is:

Installing collected packages: avro, protobuf, httplib2, oauth2client, pyyaml, dill, google-apitools, proto-google-cloud-datastore-v1, googledatastore, google-cloud-core, google-cloud-bigquery, apache-beam, tensorflow-transform Exception: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main status = self.run(options, args) File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 342, in run prefix=options.prefix_path, File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 784, in install **kwargs File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/req/req_install.py", line 851, in install self.move_wheel_files(self.source_dir, root=root, prefix=prefix) File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/req/req_install.py", line 1064, in move_wheel_files isolated=self.isolated, File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/wheel.py", line 345, in move_wheel_files clobber(source, lib_dir, True) File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/wheel.py", line 316, in clobber ensure_dir(destdir) File "/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/utils/__init__.py", line 83, in ensure_dir os.makedirs(path) File "/usr/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/avro'

PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message '')

Ran the job successfully.
Got to the visualization step
ran

 tensorboard --logdir=$GCS_JOB_DIR
PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message '')
	 when reading gs://test-ml-ohad/jobs/job3

I have logged in with gcloud to the right account.

Why is the output layer a value of 25?

As this is a classification problem with only two classes I would expect the output layer to only have 2 neurons.

Judging by the hidden layers of [100, 70, 50, 25] it looks like the output layer is 25.

model name error in flowers/sample.sh

I get the following error:

gcloud ml-engine versions create "$VERSION_NAME" \
  --model "$MODEL_NAME" \
  --origin "${GCS_PATH}/training/model" \
  --runtime-version=1.0
ERROR: (gcloud.ml-engine.versions.create) FAILED_PRECONDITION: Field: version.deployment_uri Error: With v1 endpoint, the model directory gs://yt-8m-158920-ml/slcott/flowers_slcott_20170331_163758/training/model is expected to contain exactly one of thefollowing: saved_model.pb' file or 'saved_model.pbtxt' file.Please use SavedModel.

at: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/sample.sh#L75

MODEL_NAME is flowers and VERSION_NAME is v1 but it expects something else instead?

census: only train loss and only eval accuracy showing up in tensorboard

For problems running the sample code please provide the following information.

System information

Mac
1.2
Python 2.7

I am following the Tensorflow Census sample and running the ml-engine package locally and inspecting the tensorboard output. For loss I get the training loss, but not the eval loss and for accuracy I get the eval accuracy but not the training accuracy.

Should I be seeing both train and eval for both loss/accuracy scalars? If this is expected, how can I output both train and eval for both metrics?

ImportError: No module named apache_beam

Hi,

While running the flower sample, i am getting error:

python trainer/preprocess.py
--input_dict "$DICT_FILE"
--input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv"
--output_path "${GCS_PATH}/preproc/eval"
--cloud
Traceback (most recent call last):
File "trainer/preprocess.py", line 71, in
import apache_beam as beam
ImportError: No module named apache_beam

I have followed the requirement guidelines. I dont know why i am getting this error. Any help please?

Preprocessing for online predictions

Hello

This is more of a question than a request.. or could be construed as a feature request. We are in the process of setting up google cloud ml for a case where we need online predictions for a text classification problem. From google cloud ml documentation, it seems like making a REST api call to a deployed model might be the way to go.

However, the documentation is unclear on what kind of preprocessing can be supported in the deployed model in the online setting. We plan to pass a dictionary of {string:double} features (think TFIDF) and let the model take care of preprocessing like converting words to corresponding indices. In this case, one way is to have specific ops in my tensorflow graph that convert strings to corresponding features and specify 'inputs' collection appropriately. However, for more complex features, doing feature wrangling inside a tensorflow graph quickly becomes tedious. Is there anyway to specify a custom preprocessing step for online prediction before the tensorflow graph is invoked?

While the iris code write by @joshgc was super helpful, it also does preprocessing before making a batch prediction call to the deployed model, which we are trying to avoid.

Thanks!
Rajhans

p.s. Hey Josh... nice work! :)

error in flowers/sample.sh

I get the following error when I run sample.sh from flowers sample:

gcloud ml-engine models create "$MODEL_NAME" \
  --region us-central1
ERROR: (gcloud.ml-engine.models.create) unrecognized arguments:
  --region (did you mean '--regions'?)
  us-central1

https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/sample.sh#L69

should it be regions us-central1?

Missing self in flowers adaptive_wait method

Looks like there is a small error in the flowers example pipeline module. I know you currently don't accept contributions, so I'll just point it out here.

The prediction attempt called from the adaptive_wait method as part of deploy_model is missing a self. As a result, after successful deployment of the model, prediction attempts will quietly fail until max wait time is exceeded. Changing that line to self.predict(self.args.sample_image_uri) should fix it.

Running the flowers example hangs / reports its going to take days to process

I'm trying to execute the flowers preprocessing example in Google Cloud Dataflow.

It reports back that it is going to take over 15 days to complete the pre-processing of the images.

I'm following this example line for line -

https://codelabs.developers.google.com/codelabs/cpb102-txf-learning/index.html?index=..%2F..%2Findex#4

I've also tried to follow along instructions from here and kick-off the pipeline from my machine using the gcloud CLI -

https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow

Looks like other users are reporting a similar issue (just 5 days ago) -
https://plus.google.com/102370439237675402204/posts/5rbTgExPeUy

Wrong title for Figure 2 in the tutorial

I guess the writer forgot to change the title for Figure 2 In this tutorial.

Flowers sample.sh needs to be updated for ML Engine GA, and require core/project_id is set

I cannot submit a training job for the flowers sample anymore, it fails with the following error:
ERROR: (gcloud.beta.ml.jobs.submit.training) value for field [projectsId] for in collection [ml.projects.jobs] is required but was not provided

reddit_tft tokenization

For the reddit_tft transformation, the tokenization is limited to tf.string_split() which splits on white space. Does tensorflow-transform have any ops to help with some other basic tokenization, such as first converting all text to lowercase and filtering out special characters? I can't find any samples that handle this preprocessing with tensorflow-transform. The seq2seq tutorial has some basic tokenization built in, but it's not clear to me if tft can replicate the preprocessing functionality that they use.

$CENSUS_DATA missing in Readme instructions

-export CENSUS_DATA=census_data got deleted in the latest Readme commit. For clarity, I think it should be added back.

mnist hyperparameter tuning example fails

Hi. Running through the examples to ensure my environment was working and this example failed ...

(cloudml) paddlescoot@paddlescoot-Satellite-P870:~/google-cloud-ml/samples/mnist/hptuning$ gcloud beta ml jobs describe --project ${PROJECT_ID} ${JOB_NAME}
createTime: '2016-10-03T07:32:09Z'
endTime: '2016-10-03T07:44:00Z'
errorMessage: 'Too many hyperparameter tuning metrics were written by Hyperparameter
  Tuning Trial #1.'
jobId: mnist_hptuning_hullpod_central
startTime: '2016-10-03T07:32:12Z'
state: FAILED
trainingInput:
  args:
  - --train_data_paths=gs://cloud-ml-data/mnist/train.tfr.gz
  - --eval_data_paths=gs://cloud-ml-data/mnist/eval.tfr.gz
  - --output_path=gs://molten-method-145221-ml/mnist_hptuning_hullpod_central/output
  hyperparameters:
    goal: MAXIMIZE
    maxParallelTrials: 2
    maxTrials: 10
    params:
    - maxValue: 400.0
      minValue: 40.0
      parameterName: hidden1
      scaleType: UNIT_LINEAR_SCALE
      type: INTEGER
    - maxValue: 250.0
      minValue: 5.0
      parameterName: hidden2
      scaleType: UNIT_LINEAR_SCALE
      type: INTEGER
    - maxValue: 0.5
      minValue: 0.0001
      parameterName: learning_rate
      scaleType: UNIT_LOG_SCALE
      type: DOUBLE
  packageUris:
  - gs://molten-method-145221-ml/cloudmldist/1475479926/trainer-0.0.0.tar.gz
  pythonModule: trainer.task
  region: us-central1
  scaleTier: STANDARD_1
trainingOutput: {}

Flowers example config for free comute tier?

I'm starting out with cloudml and have the free compute tier. I've pulled the docker file, correctly installed and run the flowers tutorial here:

https://cloud.google.com/blog/big-data/2016/12/how-to-train-and-classify-images-using-google-cloud-machine-learning-and-cloud-dataflow

When running

## Preprocess the eval set.
    python trainer/preprocess.py \
      --input_dict "$DICT_FILE" \
      --input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
      --output_path "${GCS_PATH}/preproc/eval" \
      --cloud

Can the preprocess.py script be configured to limit the number of instances to comply with free tier rules? Looking into dataflow args.

No module named util error in flowers sample

When I try to run the flowers sample I get the following error:

Scotts-MBP:flowers slcott$ ./sample.sh 

Using job id:  flowers_slcott_20170331_101454

# Takes about 30 mins to preprocess everything.  We serialize the two
# preprocess.py synchronous calls just for shell scripting ease; you could use
# --runner DataflowRunner to run them asynchronously.  Typically,
# the total worker time is higher when running on Cloud instead of your local
# machine due to increased network traffic and the use of more cost efficient
# CPU's.  Check progress here: https://console.cloud.google.com/dataflow
python trainer/preprocess.py \
  --input_dict "$DICT_FILE" \
  --input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
  --output_path "${GCS_PATH}/preproc/eval" \
  --cloud
No handlers could be found for logger "oauth2client.contrib.multistore_file"
/usr/local/lib/python2.7/site-packages/apache_beam/coders/typecoders.py:132: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==0.5.5
  Using cached google-cloud-dataflow-0.5.5.tar.gz
  Saved /var/folders/z2/pb7n1y692j96gj4q1wgwll400000gn/T/tmpqz_elG/google-cloud-dataflow-0.5.5.tar.gz
Successfully downloaded google-cloud-dataflow
Traceback (most recent call last):
  File "trainer/preprocess.py", line 452, in <module>
    main(sys.argv[1:])
  File "trainer/preprocess.py", line 448, in main
    run(arg_dict)
  File "trainer/preprocess.py", line 379, in run
    configure_pipeline(p, in_args)
  File "/usr/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 170, in __exit__
    self.run().wait_until_finish()
  File "/usr/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 711, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
(8db641b87e775c9): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 666, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 411, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 230, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 346, in load_session
    module = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 406, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named util

Scotts-MBP:flowers slcott$ ls
README.md		pipeline.py		setup.py
__init__.py		requirements.txt	trainer
images_to_json.py	sample.sh
Scotts-MBP:flowers slcott$ cd trainer
Scotts-MBP:trainer slcott$ ls
__init__.py	preprocess.py	util.py
model.py	task.py		util.pyc
Scotts-MBP:trainer slcott$ cd ..

Seems to be an issue with apache beam...

ImportError: No module named tensorflow

System information

OS Platform and Distribution: Windows 10 Home 10.0.15063
TensorFlow version: 1.2.0
Python version: 3.5.3
Exact command to reproduce: gcloud ml-engine local train --module-name trainer.task --package-path trainer/ -- --train-files %TRAIN_DATA% --eval-files %EVAL_DATA% --train-steps 1000 --job-dir %MODEL_DIR% --eval-steps 100

Problem description

I am trying to get the census example in Getting started to work on the command line in Windows. I am using Anaconda to set up the Python environment. Tensorflow is installed, it shows up in pip and I can import it without problem from the Python REPL. However, the module cannot be found running through gcloud ml-engine local train.
I have not modified the example code.

Source code / logs

Traceback (most recent call last):
  File "C:\Users\<USER>\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\bundledpython\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Users\<USER>\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\bundledpython\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Users\<SOME_PATH>\cloudml-samples-master\census\estimator\trainer\task.py", line 4, in <module>
    import model
  File "trainer\model.py", line 20, in <module>
    import tensorflow as tf
ImportError: No module named tensorflow

"Skipping evaluation due to same checkpoint"

System information

Mac OS, running via ML engine local train or cloud train
TF 1.2
Python 2.7

I posted this to stackoverflow and got no responses, but the census sample is only running evaluations 1x per epoch even though I have set the min-eval-frequency flag. Basically, the error message suggests that since the checkpoint is only being updated 1x per epoch, the evaluation is only being run 1x per epoch and the stackdriver logs output "Skipping evaluation due to same checkpoint..." every min-eval-frequency steps (every 100 steps with the default). Is there an additional flag that must be set in the export strategy or something similar to either have checkpoints export more frequently? Alternatively, is there some way using min-eval-frequency to override this "skipping evaluation" so that it is done continuously?

Thanks!

Cloud ML sdk for python3

Is there any timeline on when do you plan to add support for python3 in cloud ml SDK? It appears many other cloud libraries do support python 3 (like storage), and it will be great to have it for the ML sdk as well.

Thanks!

ImportError: No module named nets

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian GNU/Linux 8.8
TensorFlow version (use command below): 1.2.1
Python version: 2.7
Exact command to reproduce: bash sample.sh
Tensorflow Transform environment (if applicable, see below):
tensorflow==1.2.1
apache-beam==2.0.0

To obtain the Tensorflow and Tensorflow Transform environment do

pip freeze |grep tensorflow
pip freeze |grep apache-beam

My problem

When I run "sample.sh", I have the error as below.
I set the $PYTHONPATH to
:/home/claire_chen/models/slim

Does someone know how to fix this problem? Thanks!!

Error message

Traceback (most recent call last):
  File "trainer/preprocess.py", line 492, in <module>
    main(sys.argv[1:])
  File "trainer/preprocess.py", line 488, in main
    run(arg_dict)
  File "trainer/preprocess.py", line 396, in run
    configure_pipeline(p, in_args)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 183, in __exit__
    self.run().wait_until_finish()
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 778, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
(c0a22c82b32c7b96): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 705, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 445, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 247, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
    module = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce
    value = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
    return getattr(__import__(module, None, None, [obj]), obj)
ImportError: No module named nets

Flower Tutorial - Python Versions - Windows Impossible?

What versions of python were used to test the flower tutorial. The pipeline.py imports both apache_beam and tensorflow. On windows, this would not seem possible?

I just updated to python 3.5 64 bit to accommodate tensorflow 1.0

but now I find that apache_beam returns

C:\Users\Ben\Documents\cloudml-samples\flowers>python pipeline.py
Traceback (most recent call last):
  File "pipeline.py", line 29, in <module>
    import apache_beam as beam
  File "C:\Program Files\Python35\lib\site-packages\apache_beam\__init__.py", line 72, in <module>
    'It is not supported on Python [%s].' % sys.version_info)
TypeError: not all arguments converted during string formatting

Looking in the init file

  'Dataflow SDK for Python is supported only on Python 2.7. '

I can boot up a ubuntu docker container if needed, just less easy to handle the code editor.

EDIT: On second thought, why do we need local tensorflow, my whole hope is to use GCP to not have to train locally. Any workaround?

Line 343 uses the tensorflow

"""Produces a JSON request suitable to send to CloudML Prediction API.

Args:
  uri: The input image URI.
  output_json: File handle of the output json where request will be written.
"""
def _open_file_read_binary(uri):
  try:
    return file_io.FileIO(uri, mode='rb')
  except errors.InvalidArgumentError:
    return file_io.FileIO(uri, mode='r')

anyway I can go to source to get this function, to not need this whole module?

Pin dependencies in installation instructions for reddit_tft

Reading directly from the Bigquery tables

$ python preprocess.py --training_data fh-bigquery.reddit_comments.2015_12 \
>                      --eval_data fh-bigquery.reddit_comments.2016_01 \
>                      --predict_data fh-bigquery.reddit_comments.2016_02 \
>                      --output_dir $GCS_PATH/preproc \
>                      --project_id $PROJECT \
>                      --cloud

We get the following error.

(343abf01bbde6d65): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 544, in do_work
    work_executor.execute()
  File "dataflow_worker/executor.py", line 971, in dataflow_worker.executor.MapTaskExecutor.execute (dataflow_worker/executor.c:30533)
    with op.scoped_metrics_container:
  File "dataflow_worker/executor.py", line 972, in dataflow_worker.executor.MapTaskExecutor.execute (dataflow_worker/executor.c:30481)
    op.start()
  File "dataflow_worker/executor.py", line 499, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:17992)
    def start(self):
  File "dataflow_worker/executor.py", line 500, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:17886)
    with self.scoped_start_state:
  File "dataflow_worker/executor.py", line 505, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:17087)
    pickler.loads(self.spec.serialized_fn))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 212, in loads
    return dill.loads(s)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in loads
    return load(file)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/lib/python2.7/pickle.py", line 1126, in find_class
    klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'flatten_value_to_list'

F tensorflow/core/platform/cloud/http_request.cc:334] Check failed: that->post_body_read_ <= that->post_body_buffer_.size()

Hi,

While running the reddit sample I am getting the following error:

python preprocess.py --training_data fh-bigquery.reddit_comments.2015_12 \
>                      --eval_data fh-bigquery.reddit_comments.2016_01 \
>                      --predict_data fh-bigquery.reddit_comments.2016_02 \
>                      --output_dir $GCS_PATH/preproc \
>                      --project_id $PROJECT \
>                      --cloud
No handlers could be found for logger "oauth2client.contrib.multistore_file"
2017-06-23 12:41:21.869701: F tensorflow/core/platform/cloud/http_request.cc:334] Check failed: that->post_body_read_ <= that->post_body_buffer_.size() 
Abort trap: 6

Any idea why I am getting this error?

Thanks

Wrongly identify unsupported TF version

To check the development environment through check_environment.py after intalled TF from source, it spits out an error message "ERROR: Unsupported TensorFlow version: 0.12.head (minimum 0.11.0rc0)." The reason is that in the code parse_version cannot identify correctly the version format like "0.12.head" due to PEP 440.

Where to change Label Count for flowers pipeline

I am trying to train a model using the flowers pipeline, but i need to change the count from the default 5 (+1 for no label) to 2. My network will be binary.

If you don't change anything the model will be built successfully, but it will have lots of empty classes in prediction. I am unsure about the effect this will have on the model, but it can' t be very good.

KEY                                                                PREDICTION  SCORES
gs://api-project-773889352370-ml/Hummingbirds/Positives/10000.jpg  5           [3.398074573460197e-12, 3.3369501606372864e-11, 8.217751479300262e-11, 6.728390761212566e-11, 2.157501824764929e-12, 1.0]

In model.py, you can change Line 81 from

  parser.add_argument('--label_count', type=int, default=5)

  parser.add_argument('--label_count', type=int, default=2)

but that seems to only control part of the model, stackdriver gives the log errors:

The key part is

Assign requires shapes of both tensors to match. lhs shape= [512,3] rhs shape= [512,6]

I might suggest adding a top flag --label count, as this will be a common thing for users to try.

Thanks!

Different workers always process the same batch

For the census example in the model.input_fn() function for both tf.train.shuffle_batch() and tf.train.batch() unless shared_name is specified both these queues are replicated for all workers and the master.

This causes all workers to process all batches in the same order as every other worker (i.e. the same batch at the same time if they are going at the same speed). I don't know if this is intended behaviour, I expected each worker to process a new batch each time they took one from the queue.

The code I am referring to is under /census/tensorflowcore/trainer/model.py and the function is model.input_fn(), the specific bit is:

  if shuffle:
    features = tf.train.shuffle_batch(
        features,
        batch_size,
        min_after_dequeue=2 * batch_size + 1,
        capacity=batch_size * 10,
        num_threads=multiprocessing.cpu_count(),
        enqueue_many=True,
        allow_smaller_final_batch=True
    )
  else:
    features = tf.train.batch(
        features,
        batch_size,
        capacity=batch_size * 10,
        num_threads=multiprocessing.cpu_count(),
        enqueue_many=True,
        allow_smaller_final_batch=True
    )

Flowers sample preprocessing fails to run on cloud

Running sample.sh fails at preprocessing.

The issue seems to be with pickling. The traceback ends with: pickle.PicklingError: Can't pickle <class 'gtk.__main__.GInitiallyUnowned'>: it's not found as gtk.__main__.GInitiallyUnowned. I have apache-beam[gcp]==0.6.0 installed and my dill version is 0.2.6.

Here's the full Traceback:

  File "trainer/preprocess.py", line 482, in <module>
    main(sys.argv[1:])
  File "trainer/preprocess.py", line 478, in main
    run(arg_dict)
  File "trainer/preprocess.py", line 386, in run
    configure_pipeline(p, in_args)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 170, in __exit__
    self.run().wait_until_finish()
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 160, in run
    pickler.dump_session(os.path.join(tmpdir, 'main_session.pickle'))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 224, in dump_session
    dill.dump_session(file_path)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 350, in dump_session
    pickler.dump(main)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 124, in save_module
    return old_save_module(pickler, obj)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 1197, in save_module
    state=_main_dict)
  File "/usr/lib/python2.7/pickle.py", line 425, in save_reduce
    save(state)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 160, in new_save_module_dict
    return old_save_module_dict(pickler, obj)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 841, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 1258, in save_type
    StockPickler.save_global(pickler, obj)
  File "/usr/lib/python2.7/pickle.py", line 754, in save_global
    (obj, module, name))
pickle.PicklingError: Can't pickle <class 'gtk.__main__.GInitiallyUnowned'>: it's not found as gtk.__main__.GInitiallyUnowned```

check_environment.py error:{ code :400 ~~

I had to perform the following commands to

ERROR: Unable to list Cloud ML models: {
"error": {
"code": 400,
"message": "Field: parent Error: Please make sure that Google Cloud Machine Learning API is enabled for the project.",
"status": "FAILED_PRECONDITION",

I don't understand this mean...
Please I would like someone to provide a solution to this problem.

Failed training job for reddit

I'm a beginner with Python and Tensorflow.

I ran the pre-processing script without a problem.

I trained the linear model.

However when I train the deep model, it fails.

Note that I changed:

workerCount: 5
parameterServerCount: 2

Error:
The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 29, in from tensorflow_transform.saved import input_fn_maker File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/input_fn_maker.py", line 23, in from tensorflow_transform.saved import saved_transform_io File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_transform_io.py", line 26, in from tensorflow_transform.saved import saved_model_loader File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_model_loader.py", line 21, in from tensorflow.python.saved_model import loader_impl ImportError: cannot import name loader_impl The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 29, in from tensorflow_transform.saved import input_fn_maker File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/input_fn_maker.py", line 23, in from tensorflow_transform.saved import saved_transform_io File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_transform_io.py", line 26, in from tensorflow_transform.saved import saved_model_loader File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_model_loader.py", line 21, in from tensorflow.python.saved_model import loader_impl ImportError: cannot import name loader_impl The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 29, in from tensorflow_transform.saved import input_fn_maker File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/input_fn_maker.py", line 23, in from tensorflow_transform.saved import saved_transform_io File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_transform_io.py", line 26, in from tensorflow_transform.saved import saved_model_loader File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_model_loader.py", line 21, in from tensorflow.python.saved_model import loader_impl ImportError: cannot import name loader_impl To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=283482469300&resource=ml_job%2Fjob_id%2Freddit_comments_deep_daniel_20170417_152317&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22reddit_comments_deep_daniel_20170417_152317%22

Training input:

{
"scaleTier": "CUSTOM",
"masterType": "complex_model_m",
"workerType": "complex_model_m",
"parameterServerType": "complex_model_m",
"workerCount": "5",
"parameterServerCount": "2",
"packageUris": [
"gs://naturalscore-159511-mlengine/reddit_comments_deep_daniel_20170417_152317/f9641cc367d4b498367f681a16628d6f0f9287e62ad01e9e9ca586fdf690c94d/trainer-1.0.tar.gz"
],
"pythonModule": "trainer.task",
"args": [
"--model_type",
"deep",
"--hidden_units",
"1062",
"1062",
"1062",
"1062",
"1062",
"1062",
"1062",
"1062",
"1062",
"1062",
"1062",
"--batch_size",
"512",
"--eval_steps",
"250",
"--output_path",
"gs://naturalscore-159511-mlengine/daniel/reddit_comments/output/reddit_comments_deep_daniel_20170417_152317",
"--raw_metadata_path",
"gs://naturalscore-159511-mlengine/daniel/reddit_comments/preproc/raw_metadata",
"--transformed_metadata_path",
"gs://naturalscore-159511-mlengine/daniel/reddit_comments/preproc/transformed_metadata",
"--transform_savedmodel",
"gs://naturalscore-159511-mlengine/daniel/reddit_comments/preproc/transform_fn",
"--eval_data_paths",
"gs://naturalscore-159511-mlengine/daniel/reddit_comments/preproc/features_eval*",
"--train_data_paths",
"gs://naturalscore-159511-mlengine/daniel/reddit_comments/preproc/features_train*"
],
"region": "us-central1"
}

Iris pipeline fails when run locally

Using the current master branch (7a002e7) and tensorflow version:

python -c 'import tensorflow as tf; print(tf.__version__)'  # for Python 2
0.11.0rc1

Ubuntu version

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial

Python version and command:

python --version
Python 2.7.12

cd cloudml-samples2/iris
python preprocess.py #no errors
python pipeline.py

I get the following warnings and error.

INFO:tensorflow:Loss for final step: 0.00384905.
WARNING:tensorflow:Given features: {'measurements': <tf.Tensor 'ParseExample/ParseExample:1' shape=(30, 4) dtype=float32>, 'key': <tf.Tensor 'ParseExample/ParseExample:0' shape=(30, 1) dtype=string>}, required signatures: {'measurements': TensorSignature(dtype=tf.float32, shape=TensorShape([Dimension(30), Dimension(4)]), is_sparse=False), 'key': TensorSignature(dtype=tf.string, shape=TensorShape([Dimension(30), Dimension(1)]), is_sparse=False)}.
WARNING:tensorflow:Given targets: Tensor("ParseExample/ParseExample:2", shape=(30, 1), dtype=int64), required signatures: TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(30), Dimension(1)]), is_sparse=False).
INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='measurements', dimension=4, default_value=None, dtype=tf.float32, normalizer=None)
WARNING:tensorflow:Please specify metrics using MetricSpec. Using bare functions or (key, fn) tuples is deprecated and support for it will be removed on Oct 1, 2016.
WARNING:tensorflow:Please specify metrics using MetricSpec. Using bare functions or (key, fn) tuples is deprecated and support for it will be removed on Oct 1, 2016.
INFO:tensorflow:Restored model from /tmp/tmpxzTBGqtrain_161023_125858_f826/model/train
INFO:tensorflow:Eval steps [0,100) for training step 5000.
INFO:tensorflow:Results after 10 steps (0.002 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 20 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 30 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 40 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 50 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 60 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 70 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 80 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 90 steps (0.002 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
INFO:tensorflow:Results after 100 steps (0.001 sec/batch): loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9.
W tensorflow/core/kernels/queue_base.cc:294] _6_input_producer: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
W tensorflow/core/kernels/queue_base.cc:294] _8_batch/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors.CancelledError'>, Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

Caused by op u'input_producer/input_producer_EnqueueMany', defined at:
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 276, in <module>
    main()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 271, in main
    output_dir=output_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 105, in run
    return task()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 300, in train_and_evaluate
    name=eval_dir_suffix)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 461, in evaluate
    steps=steps, metrics=metrics, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 399, in evaluate
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 758, in _evaluate_model
    features, targets = input_fn()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 85, in input_fn
    _, examples = util.read_examples(data_paths, batch_size, shuffle)
  File "trainer/util.py", line 100, in read_examples
    filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 196, in string_input_producer
    summary_name="fraction_of_%d_full" % capacity)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 140, in input_producer
    enq = q.enqueue_many([input_tensor])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 371, in enqueue_many
    self._queue_ref, vals, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1018, in _queue_enqueue_many
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 756, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
    self._traceback = _extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

WARNING:tensorflow:Coordinator didn't stop cleanly: Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

Caused by op u'input_producer/input_producer_EnqueueMany', defined at:
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 276, in <module>
    main()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 271, in main
    output_dir=output_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 105, in run
    return task()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 300, in train_and_evaluate
    name=eval_dir_suffix)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 461, in evaluate
    steps=steps, metrics=metrics, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 399, in evaluate
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 758, in _evaluate_model
    features, targets = input_fn()
  File "/home/tensorflowworkspace/cloudml-samples2/iris/trainer/task.py", line 85, in input_fn
    _, examples = util.read_examples(data_paths, batch_size, shuffle)
  File "trainer/util.py", line 100, in read_examples
    filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 196, in string_input_producer
    summary_name="fraction_of_%d_full" % capacity)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 140, in input_producer
    enq = q.enqueue_many([input_tensor])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 371, in enqueue_many
    self._queue_ref, vals, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1018, in _queue_enqueue_many
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 756, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
    self._traceback = _extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
     [[Node: input_producer/input_producer_EnqueueMany = QueueEnqueueMany[Tcomponents=[DT_STRING], _class=["loc:@input_producer"], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer, input_producer/Identity)]]

INFO:tensorflow:Saving evaluation summary for 5000 step: loss = 0.186699, training/hptuning/metric = 0.9, accuracy = 0.9

dangling references to criteo in reddit.py

criteo is referenced from a couple comments in reddit.py

Supporting for python zip executables in cloud training

Bazel has native support for generating zip executive now:
https://github.com/bazelbuild/bazel/blob/master/tools/build_rules/py_rules.bzl

It'll be nice if the binary can be directly submitted to ml for training, instead of repackaging with setup.py

UnboundLocalError: local variable 'summaries' referenced before assignment

I'm on Windows 10 64-bit, with Python 3.5 and using Tensorflow 1.10. I've set the argument to point the training files and evaluation files correspondingly to local as well according to the guidelines.
I was trying to run the census sample of tensorflowcore folder as a local python program, however, I received an Unbound Local Error in the _run_eval method :

            with coord.stop_on_exception():
                eval_step = 0
                while self._eval_steps is None or eval_step < self._eval_steps:
                    summaries, final_values, _ = session.run([self._summary_op, self._final_ops_dict, self._eval_ops])
                    tf.logging.info("TESTING FORMAT: {}".format(summaries))
                    if eval_step % 100 == 0:
                        tf.logging.info("On Evaluation Step: {}".format(eval_step))
                    eval_step += 1
            # Write the summaries

            self._file_writer.add_summary(summaries, global_step=train_step)
            self._file_writer.flush()
            tf.logging.info(final_values)

For some reason, an exception is received after first execution of the session.run([self._summary_op, self._final_ops_dict, self._eval_ops]) , and the loop terminates and the next line of execution turns out to be self._file_writer.add_summary(summaries, global_step=train_step), causing the summaries variable to be not initialized.
The following are the logs of the execution :

WARNING:tensorflow:Unknown arguments: []
INFO:tensorflow:Created DNN hidden units [256, 64]
WARNING:tensorflow:From C:\Users\antho\Desktop\cloudml-samples-master\census\tensorflowcore\trainer\model.py:146: string_to_index_table_from_tensor (from tensorflow.contrib.lookup.lookup_ops) is deprecated and will be removed after 2017-04-10.
Instructions for updating:
Use `index_table_from_tensor`.
WARNING:tensorflow:From C:\Users\antho\Desktop\cloudml-samples-master\census\tensorflowcore\trainer\model.py:146: string_to_index_table_from_tensor (from tensorflow.contrib.lookup.lookup_ops) is deprecated and will be removed after 2017-04-10.
Instructions for updating:
Use `index_table_from_tensor`.
INFO:tensorflow:Create CheckpointSaverHook.
2017-06-09 06:36:47.299762: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-06-09 06:36:47.300229: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-09 06:36:47.300658: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-09 06:36:47.301100: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-09 06:36:47.301545: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-09 06:36:47.302024: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-09 06:36:47.302391: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-09 06:36:47.302766: W c:\tf_jenkins\home\workspace\release-win\device\cpu\os\windows\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
INFO:tensorflow:Saving checkpoints for 0 into output_folder\model.ckpt.
INFO:tensorflow:global_step/sec: 23.84
INFO:tensorflow:global_step/sec: 201.377
INFO:tensorflow:global_step/sec: 204.354
INFO:tensorflow:global_step/sec: 196.517
INFO:tensorflow:global_step/sec: 196.324
INFO:tensorflow:global_step/sec: 192.727
INFO:tensorflow:global_step/sec: 200.057
INFO:tensorflow:global_step/sec: 207.537
INFO:tensorflow:global_step/sec: 194.037
INFO:tensorflow:global_step/sec: 195.747
INFO:tensorflow:Saving checkpoints for 1000 into output_folder\model.ckpt.
INFO:tensorflow:Restoring parameters from output_folder\model.ckpt-1000
INFO:tensorflow:Starting Evaluation For Step: 1000
INFO:tensorflow:Error reported to Coordinator: <class 'TypeError'>, Fetch argument dict_values([<tf.Tensor 'accuracy/update_op:0' shape=() dtype=float32>, <tf.Tensor 'auc/update_op:0' shape=() dtype=float32>]) has invalid type <class 'dict_values'>, must be a string or Tensor. (Can not convert a dict_values into a Tensor or Operation.)
Traceback (most recent call last):
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 485, in <module>
    dispatch(**parse_args.__dict__)
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 387, in dispatch
    return run('', True, *args, **kwargs)
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 298, in run
    step, _ = session.run([global_step_tensor, train_op])
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\training\monitored_session.py", line 500, in __exit__
    self._close_internal(exception_type)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\training\monitored_session.py", line 532, in _close_internal
    h.end(self._coordinated_creator.tf_sess)
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 122, in end
    self._run_eval()
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 151, in _run_eval
    self._file_writer.add_summary(summaries, global_step=train_step)
UnboundLocalError: local variable 'summaries' referenced before assignment

Upon removing the with coord.stop_on_exception() line, here's the error caused from the sess.run :

Traceback (most recent call last):
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 267, in __init__
    fetch, allow_tensor=True, allow_operation=True))
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\framework\ops.py", line 2414, in as_graph_element
    return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\framework\ops.py", line 2503, in _as_graph_element_locked
    % (type(obj).__name__, types_str))
TypeError: Can not convert a dict_values into a Tensor or Operation.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files (x86)\JetBrains\PyCharm 2017.1\helpers\pydev\pydevd.py", line 1585, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files (x86)\JetBrains\PyCharm 2017.1\helpers\pydev\pydevd.py", line 1015, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files (x86)\JetBrains\PyCharm 2017.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 482, in <module>
    dispatch(**parse_args.__dict__)
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 385, in dispatch
    return run('', True, *args, **kwargs)
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 296, in run
    step, _ = session.run([global_step_tensor, train_op])
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\training\monitored_session.py", line 500, in __exit__
    self._close_internal(exception_type)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\training\monitored_session.py", line 532, in _close_internal
    h.end(self._coordinated_creator.tf_sess)
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 122, in end
    self._run_eval()
  File "C:/Users/antho/Desktop/cloudml-samples-master/census/tensorflowcore/trainer/task.py", line 144, in _run_eval
    summaries, final_values, _ = session.run([self._summary_op, self._final_ops_dict, self._eval_ops])
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 778, in run
    run_metadata_ptr)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 969, in _run
    fetch_handler = _FetchHandler(self._graph, fetches, feed_dict_string)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 408, in __init__
    self._fetch_mapper = _FetchMapper.for_fetch(fetches)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 230, in for_fetch
    return _ListFetchMapper(fetch)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 337, in __init__
    self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 337, in <listcomp>
    self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 238, in for_fetch
    return _ElementFetchMapper(fetches, contraction_fn)
  File "C:\Users\antho\Anaconda3\envs\fyp\lib\site-packages\tensorflow\python\client\session.py", line 271, in __init__
    % (fetch, type(fetch), str(e)))
TypeError: Fetch argument dict_values([<tf.Tensor 'auc/update_op:0' shape=() dtype=float32>, <tf.Tensor 'accuracy/update_op:0' shape=() dtype=float32>]) has invalid type <class 'dict_values'>, must be a string or Tensor. (Can not convert a dict_values into a Tensor or Operation.)

What version is required to run these examples?

What versions of Tensorflow are these examples made for? If they are not compatible with newer versions of TF, can we update the readme to indicate as such?

I am trying to run:

gcloud ml-engine local train     --module-name trainer.task     --package-path trainer/     --     --train-files $TRAIN_DATA     --eval-files $EVAL_DATA     --train-steps 1000     --job-dir $MODEL_DIR

in census/estimator.

I receive the following error:

Traceback (most recent call last):
  File "/Users/arbolista/anaconda/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/arbolista/anaconda/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/arbolista/Code/cloudml-samples-master/census/estimator/trainer/task.py", line 4, in <module>
    import model
  File "trainer/model.py", line 40, in <module>
    tf.feature_column.categorical_column_with_vocabulary_list(
AttributeError: 'module' object has no attribute 'feature_column'

System info:

>>> import sys
>>> sys.version
'2.7.10 (default, Feb  7 2017, 00:08:15) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]
>>> import tensorflow
>>> tensorflow.__version__
'1.1.0'

(I installed tensorflow-gpu)

Mac OSx 10.12.5

Am I missing something, or has feature_column moved to contrib/layers?

init.py file missing

I'm trying to run the tutorial here: https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction but I'm getting the following error:

ERROR: (gcloud.ml-engine.jobs.submit.training) [/Users/Downloads/cloudml-samples-master 2/census/trainer] is not a valid Python package because it does not contain an __init__.py file. Please create one and try again.

If I navigate to the directory, there's an empty init file there, what could the problem be?

column_to_dtype

Why is it necessary to only return a tf.string when using a class that has a base of _SparseColumn?

see def column_to_dtype.

It looks like when I introduce the use of layers.sparse_column_with_integerized_feature I'll get a dtype error because my integers are getting cast to strings.

MNIST hptuning version can't run distributed locally.

I'm having some problems getting hptuning to run successfully; but doing local and --distributed seems to cause an error I haven't quite tracked down. Some elementary googling suggests it might be leaking between checkpoint files, but I've definitely blown away output and can't find anywhere else the distributed workers might be writing checkpoints.

Here's the log:

(test-env) Todds-MacBook-Pro:hptuning todd$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working tree clean
(test-env) Todds-MacBook-Pro:hptuning todd$ git pull
Already up-to-date.
(test-env) Todds-MacBook-Pro:hptuning todd$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working tree clean
(test-env) Todds-MacBook-Pro:hptuning todd$ # Clear the output from any previous local run.
(test-env) Todds-MacBook-Pro:hptuning todd$ rm -rf output/
(test-env) Todds-MacBook-Pro:hptuning todd$ # Train locally.
(test-env) Todds-MacBook-Pro:hptuning todd$ gcloud beta ml local train \
>   --package-path=trainer \
>   --module-name=trainer.task \
>   --distributed \
>   -- \
>   --train_data_paths=gs://cloud-ml-data/mnist/train.tfr.gz \
>   --eval_data_paths=gs://cloud-ml-data/mnist/eval.tfr.gz \
>   --output_path=output
INFO:root:Original job data: {u'args': [u'--train_data_paths=gs://cloud-ml-data/mnist/train.tfr.gz', u'--eval_data_paths=gs://cloud-ml-data/mnist/eval.tfr.gz', u'--output_path=output'], u'job_name': u'trainer.task'}
INFO:root:Original job data: {u'args': [u'--train_data_paths=gs://cloud-ml-data/mnist/train.tfr.gz', u'--eval_data_paths=gs://cloud-ml-data/mnist/eval.tfr.gz', u'--output_path=output'], u'job_name': u'trainer.task'}
INFO:root:setting eval batch size to 100
INFO:root:setting eval batch size to 100
INFO:root:Starting parameter server 0
INFO:root:Starting worker/0
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job master -> {0 -> localhost:27182}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> localhost:27183, 1 -> localhost:27184}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:27185, 1 -> localhost:27186}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job master -> {0 -> localhost:27182}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> localhost:27183, 1 -> localhost:27184}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:27185, 1 -> localhost:27186}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:27183
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:27185
INFO:root:Original job data: {u'args': [u'--train_data_paths=gs://cloud-ml-data/mnist/train.tfr.gz', u'--eval_data_paths=gs://cloud-ml-data/mnist/eval.tfr.gz', u'--output_path=output'], u'job_name': u'trainer.task'}
INFO:root:setting eval batch size to 100
INFO:root:Starting worker/1
INFO:root:Original job data: {u'args': [u'--train_data_paths=gs://cloud-ml-data/mnist/train.tfr.gz', u'--eval_data_paths=gs://cloud-ml-data/mnist/eval.tfr.gz', u'--output_path=output'], u'job_name': u'trainer.task'}
INFO:root:setting eval batch size to 100
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job master -> {0 -> localhost:27182}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> localhost:27183, 1 -> localhost:27184}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:27185, 1 -> localhost:27186}
INFO:root:Starting master/0
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:27186
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job master -> {0 -> localhost:27182}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> localhost:27183, 1 -> localhost:27184}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:27185, 1 -> localhost:27186}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:27182
INFO:root:Original job data: {u'args': [u'--train_data_paths=gs://cloud-ml-data/mnist/train.tfr.gz', u'--eval_data_paths=gs://cloud-ml-data/mnist/eval.tfr.gz', u'--output_path=output'], u'job_name': u'trainer.task'}
INFO:root:setting eval batch size to 100
INFO:root:Starting parameter server 1
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job master -> {0 -> localhost:27182}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> localhost:27183, 1 -> localhost:27184}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:27185, 1 -> localhost:27186}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:27184
WARNING:tensorflow:From /Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py:210 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge_all.
WARNING:tensorflow:From /Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py:210 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge_all.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/logging_ops.py:264 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/logging_ops.py:264 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From /Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py:210 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge_all.
WARNING:tensorflow:From /Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py:210 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge_all.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/logging_ops.py:264 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/logging_ops.py:264 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From /Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py:210 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge_all.
WARNING:tensorflow:From /Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py:210 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge_all.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/logging_ops.py:264 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/logging_ops.py:264 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py:344 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
WARNING:tensorflow:From /Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py:344 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session f9163a4c7e96e286 with config: 
device_filters: "/job:ps"
device_filters: "/job:worker/task:0"

I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session 8ca80d82bc56a2d7 with config: 
device_filters: "/job:ps"
device_filters: "/job:worker/task:1"

INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: fully_connected/biases, fully_connected_1/biases, fully_connected_2/biases, Variable, Variable_2
INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: fully_connected/biases, fully_connected_1/biases, fully_connected_2/biases, Variable, Variable_2
INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: fully_connected/biases, fully_connected_1/biases, fully_connected_2/biases, Variable, Variable_2
INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: fully_connected/biases, fully_connected_1/biases, fully_connected_2/biases, Variable, Variable_2
I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session c74f300d20e21df5 with config: 
device_filters: "/job:ps"
device_filters: "/job:master/task:0"

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Assign requires shapes of both tensors to match. lhs shape= [8,34] rhs shape= [32,10]
	 [[Node: fully_connected_2/weights/Assign = Assign[T=DT_FLOAT, _class=["loc:@fully_connected_2/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](fully_connected_2/weights, fully_connected_2/weights/Initializer/random_uniform)]]
	 [[Node: init/NoOp_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-7490249123101218299, tensor_name="edge_57_init/NoOp", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]

Caused by op u'fully_connected_2/weights/Assign', defined at:
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 559, in <module>
    tf.app.run()
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 322, in main
    run(model, argv)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 453, in run
    dispatch(args, model, cluster, task)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 494, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 193, in run_training
    self.args.batch_size)
  File "trainer/model.py", line 133, in build_train_graph
    return self.build_graph(data_paths, batch_size, is_training=True)
  File "trainer/model.py", line 99, in build_graph
    logits = inference(parsed['images'], self.hidden1, self.hidden2)
  File "trainer/model.py", line 228, in inference
    return layers.fully_connected(hidden2, NUM_CLASSES, activation_fn=None)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1346, in fully_connected
    trainable=trainable)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 244, in model_variable
    caching_device=caching_device, device=device)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 208, in variable
    caching_device=caching_device)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
    custom_getter=custom_getter)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
    custom_getter=custom_getter)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
    validate_shape=validate_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
    caching_device=caching_device, validate_shape=validate_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
    expected_shape=shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 224, in __init__
    expected_shape=expected_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 360, in _init_from_args
    validate_shape=validate_shape).op
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [8,34] rhs shape= [32,10]
	 [[Node: fully_connected_2/weights/Assign = Assign[T=DT_FLOAT, _class=["loc:@fully_connected_2/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](fully_connected_2/weights, fully_connected_2/weights/Initializer/random_uniform)]]
	 [[Node: init/NoOp_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-7490249123101218299, tensor_name="edge_57_init/NoOp", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Assign requires shapes of both tensors to match. lhs shape= [8,34] rhs shape= [32,10]
	 [[Node: fully_connected_2/weights/Assign = Assign[T=DT_FLOAT, _class=["loc:@fully_connected_2/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](fully_connected_2/weights, fully_connected_2/weights/Initializer/random_uniform)]]
	 [[Node: init/NoOp_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-7490249123101218299, tensor_name="edge_57_init/NoOp", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]

Caused by op u'fully_connected_2/weights/Assign', defined at:
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 559, in <module>
    tf.app.run()
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 322, in main
    run(model, argv)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 453, in run
    dispatch(args, model, cluster, task)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 494, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 193, in run_training
    self.args.batch_size)
  File "trainer/model.py", line 133, in build_train_graph
    return self.build_graph(data_paths, batch_size, is_training=True)
  File "trainer/model.py", line 99, in build_graph
    logits = inference(parsed['images'], self.hidden1, self.hidden2)
  File "trainer/model.py", line 228, in inference
    return layers.fully_connected(hidden2, NUM_CLASSES, activation_fn=None)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1346, in fully_connected
    trainable=trainable)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 244, in model_variable
    caching_device=caching_device, device=device)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 208, in variable
    caching_device=caching_device)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
    custom_getter=custom_getter)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
    custom_getter=custom_getter)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
    validate_shape=validate_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
    caching_device=caching_device, validate_shape=validate_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
    expected_shape=shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 224, in __init__
    expected_shape=expected_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 360, in _init_from_args
    validate_shape=validate_shape).op
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [8,34] rhs shape= [32,10]
	 [[Node: fully_connected_2/weights/Assign = Assign[T=DT_FLOAT, _class=["loc:@fully_connected_2/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](fully_connected_2/weights, fully_connected_2/weights/Initializer/random_uniform)]]
	 [[Node: init/NoOp_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-7490249123101218299, tensor_name="edge_57_init/NoOp", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]

Traceback (most recent call last):
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 559, in <module>
    tf.app.run()
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 322, in main
    run(model, argv)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 453, in run
    dispatch(args, model, cluster, task)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 494, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 232, in run_training
    with self.sv.managed_session(target, config=config) as session:
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 802, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
    start_standard_services=start_standard_services)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 720, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 233, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [8,34] rhs shape= [32,10]
	 [[Node: fully_connected_2/weights/Assign = Assign[T=DT_FLOAT, _class=["loc:@fully_connected_2/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](fully_connected_2/weights, fully_connected_2/weights/Initializer/random_uniform)]]
	 [[Node: init/NoOp_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-7490249123101218299, tensor_name="edge_57_init/NoOp", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]

Caused by op u'fully_connected_2/weights/Assign', defined at:
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 559, in <module>
    tf.app.run()
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 322, in main
    run(model, argv)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 453, in run
    dispatch(args, model, cluster, task)
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 494, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/Users/todd/git/cloudml-samples/mnist/hptuning/trainer/task.py", line 193, in run_training
    self.args.batch_size)
  File "trainer/model.py", line 133, in build_train_graph
    return self.build_graph(data_paths, batch_size, is_training=True)
  File "trainer/model.py", line 99, in build_graph
    logits = inference(parsed['images'], self.hidden1, self.hidden2)
  File "trainer/model.py", line 228, in inference
    return layers.fully_connected(hidden2, NUM_CLASSES, activation_fn=None)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1346, in fully_connected
    trainable=trainable)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 244, in model_variable
    caching_device=caching_device, device=device)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 208, in variable
    caching_device=caching_device)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
    custom_getter=custom_getter)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
    custom_getter=custom_getter)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
    validate_shape=validate_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
    caching_device=caching_device, validate_shape=validate_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
    expected_shape=shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 224, in __init__
    expected_shape=expected_shape)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 360, in _init_from_args
    validate_shape=validate_shape).op
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/todd/miniconda2/envs/test-env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [8,34] rhs shape= [32,10]
	 [[Node: fully_connected_2/weights/Assign = Assign[T=DT_FLOAT, _class=["loc:@fully_connected_2/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](fully_connected_2/weights, fully_connected_2/weights/Initializer/random_uniform)]]
	 [[Node: init/NoOp_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-7490249123101218299, tensor_name="edge_57_init/NoOp", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]

check_environment.py WARNING: Couldn't find python-snappy.

I had to perform the following commands to get python-snappy installed.

sudo apt-get install libsnappy-dev
sudo pip install python-snappy

The Cloud shell appears to be missing this library and package. It might be helpful to include a note in the instructions.

tf.Transform and Google DataFlow Templates Integration

We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.

We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples
it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:

from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name) 
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)

TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview

we could leverage this to dynamically generate a tf.Transform CsvCoder

The steps of using DataFlow Templates are as follows:

A) in PipelineOptions derived types, change option types to ValueProvider
B) change runner to TemplatingDataflowPipelineRunner
C) mvn archetype:generate to store template in GCS (python way: a yaml file like TF Hypertune ???)
D) gcloud beta dataflow jobs run --gcs-location —parameters

For (A), we could define UserOptions subclassed from PipelineOptions and use the add_value_provider_argument API to add specific arguments to be parameterized:

class UserOptions(PipelineOptions):
     @classmethod
     def _add_argparse_args(cls, parser):
         parser.add_value_provider_argument('--value_provider_arg', default='some_value')
         parser.add_argument('--non_value_provider_arg', default='some_other_value')

The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner (B & C) ?

Looking at the java TemplatingDataflowPipelineRunner class , it encapsulates DataflowPipelineRunner - How can we create a custom python runner that encapsulates the apache beam python API class DataflowRunner , that provides the functionality of the java TemplatingDataflowPipelineRunner?

Reddit_TFT Training error

I am trying to train Reddit dataset on the given trainer but, It gives an error in between. I was facing trouble doing preprocessing as well and turns out the code need tensorflow-transform version 0.1.9 on both local systems as well as cloud. I am guessing the issue is because of the wrong tesnorflow version I am using for training (I am using tensorflow version 1.2 ) which version should I be using to make the example run. Also It would be nice if you upload Requrements.txt file

""Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 280, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 276, in main output_dir=output_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run return task() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train monitors=self._train_monitors + extra_hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 280, in new_func return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit loss = self._train_model(input_fn=input_fn, hooks=hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 984, in _train_model _, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 462, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 786, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 744, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 899, in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 465, in after_run raise NanLossDuringTrainingError NanLossDuringTrainingError: NaN loss during training.

fail to run reddit_tft example preprocess part on google cloud

python preprocess.py --training_data fh-bigquery.reddit_comments.2015_12                      --eval_data fh-bigquery.reddit_comments.2016_01                      --predict_data fh-bigquery.reddit_comments.2016_02                      --output_dir $GCS_PATH/preproc                      --project_id $PROJECT                      --cloud
No handlers could be found for logger "oauth2client.contrib.multistore_file"
Traceback (most recent call last):
  File "preprocess.py", line 258, in <module>
    main()
  File "preprocess.py", line 254, in main
    frequency_threshold=args.frequency_threshold)
  File "preprocess.py", line 149, in preprocess
    pipeline=pipeline))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/ptransform.py", line 709, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/ptransform.py", line 388, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 229, in apply
    return self.apply(transform, pvalueish)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 265, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.py", line 150, in apply
    return m(transform, input)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.py", line 156, in apply_PTransform
    return transform.expand(input)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/tft_beam_io/beam_metadata_io.py", line 57, in expand
    metadata_io.write_metadata(metadata, self._path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/tf_metadata/metadata_io.py", line 56, in write_metadata
    version.write(metadata, vdir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/tf_metadata/version_api.py", line 88, in write
    vdir.create()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/tf_metadata/metadata_directory.py", line 57, in create
    tf.gfile.MakeDirs(self._basepath)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'object' must be a non-empty string.

Deploying just the pre-trained checkpoint files to CloudML

Hey!

I'm trying to deploy just the model checkpoint files to cloud ML. I have the .ckpt file, .meta file and the lablemap.txt file for an inception network trained on the open images dataset. I'm looking for an option to just deploy them. I tried to work on it from the Prediction section of the Flowers post but I'm running into errors.

Any pointers/posts that talk about deploying existing checkpoint files?

Thanks!

Achieving 100% accuracy with Hypertune

I started using Hypertune of a dataset of mine using almost the same code than the one in the samples from the documentation.
The distributed version works very well for me but with Hypertune I manage to get an objective value (set to accuracy, like the MNIST sample) of 1.0 = 100% which is quite surprising. Note that I use the same config file that in the example and I get such a high accuracy for high learning rates, close to 0.5.
I thought that the error was on my side but it turns out that there is the same problem with the MNIST example. In the docs, here, 100% accuracy is also achieved with a very simple network which is very unlikely to happen.

...
state: SUCCEEDED
...
trainingOutput:
  completedTrialCount: '10'
  trials:
  - finalMetric:
      objectiveValue: 1.0
      trainingStep: '5006'
    hyperparameters:
      hidden1: '339'
      hidden2: '30'
      learning_rate: '0.49576010451421226'
    trialId: '4'
  - finalMetric:
      objectiveValue: 1.0
      trainingStep: '5009'
    hyperparameters:
      hidden1: '392'
      hidden2: '248'
      learning_rate: '0.49432185726663225'
    trialId: '5'

Also, it might not be related at all but I noticed a big discrepancy between metrics on the training and eval sets. You can find the corresponding stack overflow question here.

Thanks

EDIT: The stack overflow question is actually not related, I think.

(Update canned estimator to 1.3) Monitors are deprecated. Please use tf.train.SessionRunHook.

For problems running the sample code please provide the following information.

System information

Windows 10
Tensorflow Version 1.2.0 (GPU)
Python Version 3.6

Describe the problem

I am trying to run the code locally in a jupyter notebook and try some different models (ie. only change the build estimator to a DNNClassifier for example) and I am getting a warning "Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\contrib\learn\python\learn\monitors.py:268: BaseMonitor.init (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook."

which suggests that something in the code is still reliant on Monitors instead of tf.train.SessionRunHook. I don't see any explicit use of Monitors, and I can't find where the default is monitors so I have not been able to diagnose where I need to use tf.train.SessionRunHook. Can anybody help to identify what modifications I would need to make to get past this warning or can it safely be ignored?

Thanks for the help!

Can't train locally

I visit the website and try following the instruction to start training locally.

But I got an error "Invalid choice: 'local'".

Do I do anything wrong ?

$ gcloud beta ml local train --package-path=trainer --module-name=trainer.task
Usage: gcloud beta ml [optional flags] <group>
  group may be           jobs | models

*(BETA)* Cloud ML command groups.

global flags:
  Run `gcloud -h` for a description of flags available to all commands.

command groups:
  jobs                   *(BETA)* Cloud ML Jobs commands.
  models                 *(BETA)* Cloud ML Models commands.


For more detailed information on this command and its flags, run:
  gcloud beta ml --help

ERROR: (gcloud.beta.ml) Invalid choice: 'local'.

Valid choices are [jobs, models].

googlecloudplatform / cloudml-samples Goto Github PK

cloudml-samples's People

Contributors

Stargazers

Watchers

Forkers

cloudml-samples's Issues

System information

Describe the problem

Source code / logs

System information

System information

Problem description

Source code / logs

System information

System information

My problem

Error message

System information

Describe the problem

Recommend Projects

Recommend Topics

Recommend Org

Jobs