GithubHelp home page GithubHelp logo

intel-analytics / analytics-zoo Goto Github PK

View Code? Open in Web Editor NEW
11.0 7.0 3.0 282.59 MB

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

Home Page: https://analytics-zoo.readthedocs.io/

License: Apache License 2.0

Jupyter Notebook 74.47% Shell 1.26% Scala 10.86% Java 0.64% Python 12.45% Dockerfile 0.22% Makefile 0.04% RobotFramework 0.05% PureBasic 0.01% Groovy 0.01%

analytics-zoo's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

analytics-zoo's Issues

Add more metris

We only support "accuracy" for now and also the error message should be enriched.

def to_bigdl_metrics(metrics):
    metrics = to_list(metrics)
    bmetrics = []
    for metric in metrics:
        if metric.lower() == "accuracy":
            bmetrics.append(Top1Accuracy())
        else:
            raise TypeError("Unsupported metrics: %s" % metric)
    return bmetrics

anomaly detection notebook problems

Some issues in anomaly detection notebook problem:

  1. In Readme, it says : export ZOO_HOME=the root directory of the Analytics Zoo project. But in jupyter-with-zoo.sh, it needs ZOO_HOME to be the dist directory or build directory. We need to clarify this environment. I suggest to set another variable, like ZOO_SOURCE to the root directory of the Analytics Zoo project. And keep ZOO_HOME pointing to the build. In the notebook code, it also use ZOO_HOME as the root directory of the Analytics Zoo project.

  2. In jupyter-with-zoo.sh, it would find zoo jar and python zip with "zooxxx", need to change to "analytics-zooxxx". Also, --allow-root option is not working in my jupyter.

  3. Running notebook, I met an issue in cell 8 with command "df['hours'] = df['datetime'].dt.hour":


AttributeError Traceback (most recent call last)
in ()
1 # the hours and if it's night or day (7:00-22:00)
----> 2 df['hours'] = df['datetime'].dt.hour
3 df['daylight'] = ((df['hours'] >= 7) & (df['hours'] <= 22)).astype(int)

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in getattr(self, name)
1813 return self[name]
1814 raise AttributeError("'%s' object has no attribute '%s'" %
-> 1815 (type(self).name, name))
1816
1817 def setattr(self, name, value):

AttributeError: 'Series' object has no attribute 'dt'

/usr/lib/python2.7/dist-packages/simplejson/encoder.py:262: DeprecationWarning: Interpreting naive datetime as local 2018-05-09 13:16:01.759843. Please add timezone info to timestamps.
chunks = self.iterencode(o, _one_shot=True)

Module size will be larger than 1 when loading the TextClassifier model

When loading a textClaasifier model locally and do text prediction, usually we just call the method loadModel and predict like this:
val textClassificationModel = TextClassifier.loadModel[Float]("file:///home/yidiyang/workspace/model/text.bigdl") val results = textClassificationModel.predict(sampleRDD).collect()(0)

However it will throw an error : module size should be 1 instead of 2

Only when adding these codes before loading can it works (can the module size be 1):
val model = TextClassifier(classNum, tokenLength, sequenceLength, param.encoder, param.encoderOutputDim)

ResNet pre-processing consistency

zoo.keras.LSTM extend from zoo.keras.Recurrent?

Looks like com.intel.analytics.zoo.pipeline.api.keras.layers.Recurrent is not used in anywhere? and zoo.keras.layers.LSTM is extend from bigdl.nn.keras.Recurrent? If I update LSTM to extend from zoo.keras.Recurrent, there will be exception:

Caused by: com.intel.analytics.bigdl.nn.abstractnn.InvalidLayer: Do not mix Sequential6ed52d6b with Layer

                       (isKerasStyle=false):
     InternalRecurrent[f454c033]ArrayBuffer(TimeDistributed[5379272d]Linear[53d69303](12 -> 128), LSTM(12, 32, 0.0))
at com.intel.analytics.bigdl.nn.abstractnn.InferShape$class.excludeInvalidLayers(InferShape.scala:98)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.excludeInvalidLayers(AbstractModule.scala:58)
at com.intel.analytics.bigdl.nn.abstractnn.InferShape$class.validateInput(InferShape.scala:108)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.validateInput(AbstractModule.scala:58)
at com.intel.analytics.bigdl.nn.keras.Sequential.add(Topology.scala:299)
at com.intel.analytics.zoo.pipeline.api.keras.layers.Recurrent.doBuild(Recurrent.scala:41)
at com.intel.analytics.bigdl.nn.keras.KerasLayer.build(KerasLayer.scala:225)

And I agreed we do need a zoo.keras.Recurrent extends from bigdl.keras.Recurrent so we can add more functions.

Error when calling fit on a virtual machine

When running optimizer.optimize() in a Virtual Machine, it will throw the following error:

Py4JJavaError Traceback (most recent call last)
in ()
----> 1 get_ipython().run_cell_magic(u'time', u'', u'# Boot training process\nlenet_model.fit(x=train_data,\n batch_size=2048,\n nb_epoch=20,\n validation_data=test_data)')

/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in run_cell_magic(self, magic_name, line, cell)
2115 magic_arg_s = self.var_expand(line, stack_depth)
2116 with self.builtin_trap:
-> 2117 result = fn(magic_arg_s, cell)
2118 return result
2119

in time(self, line, cell, local_ns)

/usr/local/lib/python2.7/dist-packages/IPython/core/magic.pyc in (f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):

/usr/local/lib/python2.7/dist-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1187 if mode=='eval':
1188 st = clock2()
-> 1189 out = eval(code, glob, local_ns)
1190 end = clock2()
1191 else:

in ()

/tmp/zoo/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/keras/engine/topology.py in fit(self, x, y, batch_size, nb_epoch, validation_data, distributed)
161 batch_size,
162 nb_epoch,
--> 163 validation_data)
164 else:
165 if validation_data:

/tmp/zoo/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in callBigDlFunc(bigdl_type, name, *args)
586 error = e
587 if "does not exist" not in str(e):
--> 588 raise e
589 else:
590 return result

Py4JJavaError: An error occurred while calling o25.zooFit.
: java.lang.ExceptionInInitializerError
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:893)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:204)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:220)
at com.intel.analytics.zoo.pipeline.api.keras.python.PythonZooKeras.zooFit(PythonZooKeras.scala:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1314)
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1237)
at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:151)
at com.intel.analytics.bigdl.parameters.AllReduceParameter$.(AllReduceParameter.scala:47)
at com.intel.analytics.bigdl.parameters.AllReduceParameter$.(AllReduceParameter.scala)
... 15 more

java.lang.IllegalArgumentException in image classification

We met errors when using examples/imageclassification/Predict.scala to predict inception v1 w/ ImageNet val. But it reported java.lang.IllegalArgumentException for 10k images and java.lang.ArrayIndexOutOfBoundsException for 5k images. Predicting 1000 images can pass.

Execution script:

#!/bin/sh
master="local[28]"
modelPath=/mnt/disk1/analytics-zoo-dataset/imageclassification/analytics-zoo_inception-v1_imagenet_0.1.0
imagePath=/mnt/disk1/analytics-zoo-dataset/imageclassification/imagenet/
ZOO_HOME=/root/analytics-zoo
ZOO_JAR_PATH=${ZOO_HOME}/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-jar-with-dependencies.jar
spark-submit \
--verbose \
--master $master \
--conf spark.executor.cores=28 \
--conf spark.driver.maxResultSize=6g \
--total-executor-cores 28 \
--driver-memory 200g \
--executor-memory 40g \
--class com.intel.analytics.zoo.examples.imageclassification.Predict \
${ZOO_JAR_PATH} -f $imagePath --model $modelPath --partition 28 --topN 5

error when predicting 10000 images:

2018-05-24 15:07:37 INFO  ThreadPool$:79 - Set mkl threads to 1 on thread 1
2018-05-24 15:07:39 INFO  Engine$:103 - Auto detect executor number and executor cores number
2018-05-24 15:07:39 INFO  Engine$:105 - Executor number is 1 and executor cores number is 28
2018-05-24 15:07:39 INFO  Engine$:373 - Find existing spark context. Checking the spark conf...
[Stage 0:===============>                                          (3 + 8) / 11]2018-05-24 15:10:57 ERROR Executor:91 - Exception in task 3.0 in stage 0.0 (TID 3)
Layer info: ImageClassifier[analytics-zoo_inception-v1_imagenet_0.1.0]/SpatialConvolution[conv1/7x7_s2](3 -> 64, 7 x 7, 2, 2, 3, 3)
java.lang.IllegalArgumentException: requirement failed: input channel size 2 is not the same as nInputPlane 3
        at scala.Predef$.require(Predef.scala:224)
        at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:262)
        at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:54)
        at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:243)
        at com.intel.analytics.bigdl.nn.StaticGraph.updateOutput(StaticGraph.scala:59)
        at com.intel.analytics.zoo.models.common.ZooModel.updateOutput(ZooModel.scala:79)
        at com.intel.analytics.zoo.models.common.ZooModel.updateOutput(ZooModel.scala:79)
        at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:243)
        at com.intel.analytics.bigdl.optim.Predictor$$anonfun$predictSamples$1.apply(Predictor.scala:67)
        at com.intel.analytics.bigdl.optim.Predictor$$anonfun$predictSamples$1.apply(Predictor.scala:66)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:800)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at com.intel.analytics.bigdl.optim.Predictor$.predictImageBatch(Predictor.scala:48)

error when predicting 5000 images:

[Stage 0:>                                                          (0 + 4) / 5]
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy$mcF$sp(TensorNumeric.scala:721)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:715)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:503)
        at com.intel.analytics.bigdl.dataset.MiniBatch$.copy(MiniBatch.scala:460)
        at com.intel.analytics.bigdl.dataset.MiniBatch$.copyWithPadding(MiniBatch.scala:380)
        at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:209)
        at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:111)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:348)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)

Zoo keras doesn't support train_on_batch

When implementing CycleGan in keras, I need to train the generator and discriminator alternately, which requires train on batch/iteration. However, we only support model.fit() now.

error in AbstractInferenceModel.java

I get an error in AbstractInferenceModel here

modelQueue = new LinkedBlockingQueue<>(supportedConcurrentNum);
error: Diamond types are not supported at language level '5'

But my language level has been set to 8 in project structures.

Can not load Keras model issue.

I used the model=Net.load_keras("file:///"+BASE_PATH+"/cpu_300000_arda_model.json") api to load my model trained by keras 1.2.2 on cpu, it will throw the error :

Traceback (most recent call last):
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 225, in
main()
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 217, in main
playGame(args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 211, in playGame
trainNetwork(model,args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 102, in trainNetwork
model=Net.load_keras("file:///"+BASE_PATH+"/gpu_keras1_2_model.json","file:///"+BASE_PATH+"/model_gpu_1540000.h5")
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/net.py", line 178, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 791, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 59, in load_weights_from_json_hdf5
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 368, in from_json_path
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 372, in from_json_str
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 213, in model_from_json
return layer_from_config(config, custom_objects=custom_objects)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/layer_utils.py", line 27, in layer_from_config
class_name = config['class_name']
TypeError: string indices must be integers

The same situation when I load the model trained by keras1.2.2 on gpu,

createZooModel failure on python 3

code like this works well for python 2.7, but failed for python 3.
wide_n_deep = WideAndDeep(5, column_info, "wide_n_deep")

creating: createZooWideAndDeep
Traceback (most recent call last):
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/apps/recommendation/wide_n_deep.py", line 142, in
wide_n_deep = WideAndDeep(5, column_info, "wide_n_deep")
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/models/recommendation/wide_and_deep.py", line 118, in init
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 667, in init
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 130, in init
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 588, in callBigDlFunc
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 584, in callBigDlFunc
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 629, in callJavaFunc
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 629, in
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 656, in _py2java
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 656, in
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 671, in _py2java
File "/opt/work/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/opt/work/spark-2.1.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/work/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.bigdl.api.python.BigDLSerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.bigdl.api.python.BigDLSerDeBase.loads(BigDLSerde.scala:57)
at org.apache.spark.bigdl.api.python.BigDLSerDe.loads(BigDLSerde.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

textclassification example test failed

  1. If using --model=local[*] as required in readme,
    Error: Caused by: java.lang.IllegalArgumentException: requirement failed: total batch size: 128 should be divided by total core number: 28

  2. If using correct command, eg --model=local[16], nothing outputted while running.

Script:
export ANALYTICS_ZOO_JAR=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-0.1.0-SNAPSHOT-jar-with-dependencies.jar
export BASE_DIR=/home/sangtian/zoo_test/textclassification/

spark-submit
--master=local[16]
--driver-memory 20g
--executor-memory 20g
--class com.intel.analytics.zoo.examples.textclassification.TextClassification
${ANALYTICS_ZOO_JAR}
--baseDir ${BASE_DIR}

Exception while using the pip package generated by release.sh

>>> import zoo
2018-05-14 14:00:00 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-05-14 14:00:00 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2018-05-14 14:00:00 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/zoo/__init__.py", line 25, in <module>
    check_version()
  File "/usr/local/lib/python2.7/dist-packages/zoo/common/nncontext.py", line 46, in check_version
    _check_spark_version(sc, report_warn)
  File "/usr/local/lib/python2.7/dist-packages/zoo/common/nncontext.py", line 58, in _check_spark_version
    version_info = _get_bigdl_verion_conf()
  File "/usr/local/lib/python2.7/dist-packages/zoo/common/nncontext.py", line 105, in _get_bigdl_verion_conf
    " is located in zoo/target/extra-resources")
RuntimeError: Error while locating file zoo-version-info.properties, please make sure the mvn generate-resources phase is executed and a zoo-version-info.properties file is located in zoo/target/extra-resources

0-1 Normalize Image Pixels in ChainedPreprocessing

Hi,

Is there any way to normalize image pixel values to between 0-1 in ChainedPreprocessing? I'm seeing an ImageChannelNormalize() function, but that seems to be used primarily for zero-centering each channel, rather than for 0-1 normalizing. E.g., if all the pixel values are between 0-255, is it possible to simply plug in a lambda into the ChainedPreprocessing constructor to divide each pixel value by 255? Or would this require defining a custom Preprocessing step?

Failed to call "init_engine()" when migrated the object detection example

From zoo created by changlinzhang : intel-analytics/zoo#187

When I migrate the object detection example from BigDL API to ZOO API
It failed to call "init_engine()" in the code

from bigdl.util.common import *
...
JavaCreator.set_creator_class("com.intel.analytics.zoo.models.pythonapi.PythonModels")
init_engine()

using
${SPARK_HOME}/bin/pyspark --properties-file ${BIGDL_CONF} --py-files ${ZOO_PY_ZIP} --jars ${ZOO_JAR} ...
The error message as follows:

TypeError                                 Traceback (most recent call last)
<ipython-input-2-eda297cc30af> in <module>()
     26 
     27 JavaCreator.set_creator_class("com.intel.analytics.zoo.models.pythonapi.PythonModels")
---> 28 init_engine()

/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in init_engine(bigdl_type)
    415 
    416 def init_engine(bigdl_type="float"):
--> 417     callBigDlFunc(bigdl_type, "initEngine")
    418     # Spark context is supposed to have been created when init_engine is called
    419     get_spark_context()._jvm.org.apache.spark.bigdl.api.python.BigDLSerDe.initialize()

/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in callBigDlFunc(bigdl_type, name, *args)
    577     gateway = _get_gateway()
    578     error = Exception("Cannot find function: %s" % name)
--> 579     for jinvoker in JavaCreator.instance(bigdl_type, gateway).value:
    580         # hasattr(jinvoker, name) always return true here,
    581         # so you need to invoke the method to check if it exist or not

/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in instance(cls, bigdl_type, *args)
     54             with cls._lock:
     55                 if not cls._instance:
---> 56                     cls._instance = cls(bigdl_type, *args)
     57         return cls._instance
     58 

/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in __init__(self, bigdl_type, gateway)
     91             jclass = getattr(gateway.jvm, creator_class)
     92             if bigdl_type == "float":
---> 93                 self.value.append(getattr(jclass, "ofFloat")())
     94             elif bigdl_type == "double":
     95                 self.value.append(getattr(jclass, "ofDouble")())

TypeError: 'JavaPackage' object is not callable

Environment info:
The spark version is 1.6 and the ZOO is compiled by spark 1.6

SSD predict NPE

Follow the instructions here to try SSD with BigDL, found many NPE in executor's log, the command used is as below:
spark-submit --master spark://bb-node1:7077 --executor-cores 5 --num-executors 10 --total-executor-cores 50 --driver-memory 30G --executor-memory 200G --driver-class-path /mnt/disk1/SSD_Predict/models/ssd/jars/object-detection-0.1-SNAPSHOT-jar-with-dependencies-and-spark.jar --class com.intel.analytics.zoo.pipeline.ssd.example.Predict /mnt/disk1/SSD_Predict/models/ssd/jars/object-detection-0.1-SNAPSHOT-jar-with-dependencies-and-spark.jar -f hdfs://bb-node1:8020/dlbenchmark/data/PASCAL/seq/test/ --folderType seq --caffeDefPath /mnt/disk1/SSD_Predict/models/ssd/caffe/VGGNet/VOC0712/SSD_300x300/test.prototxt --caffeModelPath /mnt/disk1/SSD_Predict/models/ssd/caffe/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300_iter_120000.caffemodel --classname /mnt/disk1/SSD_Predict/models/ssd/caffe/VGGNet/VOC0712/classname.txt -b 200 -r 300 -p 50 -q false

executor log:
18/03/02 13:30:41 WARN FeatureTransformer$: failed /mnt/disk1/SSD_Predict/data/PASCAL/VOCdevkit/VOC2007/JPEGImages/009934.jpg in transformer class com.intel.analytics.bigdl.transform.vision.image.augmentation.Resize
java.lang.NullPointerException
at org.opencv.imgproc.Imgproc.resize(Imgproc.java:2761)
at com.intel.analytics.bigdl.transform.vision.image.augmentation.Resize$.transform(Resize.scala:69)
at com.intel.analytics.bigdl.transform.vision.image.augmentation.Resize.transformMat(Resize.scala:53)
at com.intel.analytics.bigdl.transform.vision.image.FeatureTransformer.transform(FeatureTransformer.scala:58)
at com.intel.analytics.bigdl.transform.vision.image.ChainedFeatureTransformer.transform(FeatureTransformer.scala:111)
at com.intel.analytics.bigdl.transform.vision.image.ChainedFeatureTransformer.transform(FeatureTransformer.scala:111)
at com.intel.analytics.bigdl.transform.vision.image.ChainedFeatureTransformer.transform(FeatureTransformer.scala:111)
at com.intel.analytics.bigdl.transform.vision.image.FeatureTransformer$$anonfun$apply$1.apply(FeatureTransformer.scala:80)
at com.intel.analytics.bigdl.transform.vision.image.FeatureTransformer$$anonfun$apply$1.apply(FeatureTransformer.scala:80)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1076)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1128)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Job aborted due to "/ by zero" ERROR

I have migrated the keras version flappybird to zoo.keras version flappybird, I use model.fit(state_t,targets) to train my model in distributed mode, the batch size is set to default which is 32, and I submitted the code with submit-spark-with-zoo.sh which is:
#!/bin/bash
export SPARK_HOME=/opt/work/spark-2.1.1-bin-hadoop2.7
export MASTER=spark://Almaren-Node-075:7077
export FTP_URI=$FTP_URI
export ANALYTICS_ZOO_HOME=/root/workspace/analytics-zoo
export ANALYTICS_ZOO_HOME_DIST=$ANALYTICS_ZOO_HOME/dist
export ANALYTICS_ZOO_JAR=find ${ANALYTICS_ZOO_HOME_DIST}/lib -type f -name "analytics-zoo*jar-with-dependencies.jar"
export ANALYTICS_ZOO_PYZIP=find ${ANALYTICS_ZOO_HOME_DIST}/lib -type f -name "analytics-zoo*python-api.zip"
export ANALYTICS_ZOO_CONF=${ANALYTICS_ZOO_HOME_DIST}/conf/spark-analytics-zoo.conf
export PYTHONPATH=${ANALYTICS_ZOO_PYZIP}:$PYTHONPATH

if [ -z "${ANALYTICS_ZOO_HOME}" ]; then
echo "Please set ANALYTICS_ZOO_HOME environment variable"
exit 1
fi

if [ -z "${SPARK_HOME}" ]; then
echo "Please set SPARK_HOME environment variable"
exit 1
fi

if [ ! -f ${ANALYTICS_ZOO_CONF} ]; then
echo "Cannot find ${ANALYTICS_ZOO_CONF}"
exit 1
fi

if [ ! -f ${ANALYTICS_ZOO_PY_ZIP} ]; then
echo "Cannot find ${ANALYTICS_ZOO_PY_ZIP}"
exit 1
fi

if [ ! -f ${ANALYTICS_ZOO_JAR} ]; then
echo "Cannot find ${ANALYTICS_ZOO_JAR}"
exit 1
fi

${SPARK_HOME}/bin/spark-submit
--master ${MASTER}
--driver-cores 32
--driver-memory 180g
--total-executor-cores 128
--executor-cores 32
--executor-memory 180g
--properties-file ${ANALYTICS_ZOO_CONF}
--py-files ${ANALYTICS_ZOO_PYZIP},${ANALYTICS_ZOO_HOME}/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py
--jars ${ANALYTICS_ZOO_JAR}
${ANALYTICS_ZOO_HOME}/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py
-m "Train"
--conf spark.driver.extraClassPath=${ANALYTICS_ZOO_JAR}
--conf spark.executor.extraClassPath=${ANALYTICS_ZOO_JAR}
$*

but once the training process begin it will throw this error:

TIMESTEP 400 / STATE observe / EPSILON 0.1 / ACTION 0 / REWARD 0.1 / Loss 0
TIMESTEP 401 / STATE explore / EPSILON 0.1 / ACTION 0 / REWARD 0.1 / Loss 0
2018-05-22 15:50:13 INFO DistriOptimizer$:871 - caching training rdd ...
2018-05-22 15:50:14 INFO DistriOptimizer$:664 - Cache thread models...
2018-05-22 15:50:14 INFO DistriOptimizer$:666 - Cache thread models... done
2018-05-22 15:50:14 INFO DistriOptimizer$:136 - Count dataset
2018-05-22 15:50:15 INFO DistriOptimizer$:140 - Count dataset complete. Time elapsed: 0.110093395s
2018-05-22 15:50:15 INFO DistriOptimizer$:148 - config {
maxDropPercentage: 0.0
computeThresholdbatchSize: 100
warmupIterationNum: 200
isLayerwiseScaled: false
dropPercentage: 0.0
}
2018-05-22 15:50:15 INFO DistriOptimizer$:152 - Shuffle data
2018-05-22 15:50:15 INFO DistriOptimizer$:155 - Shuffle data complete. Takes 0.031867857s
2018-05-22 15:50:15 ERROR TaskSetManager:70 - Task 1 in stage 10.0 failed 4 times; aborting job
2018-05-22 15:50:15 ERROR TaskSetManager:70 - Task 1 in stage 10.0 failed 4 times; aborting job
2018-05-22 15:50:15 ERROR DistriOptimizer$:939 - Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 fnt failure: Lost task 1.3 in stage 10.0 (TID 284, 172.16.0.178, executor 3): java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:312)
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:914)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:227)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:249)
at com.intel.analytics.zoo.pipeline.api.keras.python.PythonZooKeras.zooFit(PythonZooKeras.scala:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more

Traceback (most recent call last):
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 226, in
main()
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 218, in main
playGame(args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 212, in playGame
trainNetwork(model,args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 167, in trainNetwork
model.fit(state_t,targets)
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/keras/engine/topology.py", line 162,
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 588, in callBigDlFunc
py4j.protocol.Py4JJavaError: An error occurred while calling o35.zooFit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 4 times, most recent failure: Lost task 1.3 in s16.0.178, executor 3): java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:312)
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:914)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:227)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:249)
at com.intel.analytics.zoo.pipeline.api.keras.python.PythonZooKeras.zooFit(PythonZooKeras.scala:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more

I also set the batch size to 128 with the same submit parameters, I will encounter the same problem as before.

ChannelNormalize bug?

I was just checking the code in ImageChannelNormalize, and in BigDL
https://github.com/intel-analytics/BigDL/blob/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/transform/vision/image/augmentation/ChannelNormalize.scala#L52

I find the following code:

  def apply(meanR: Float, meanG: Float, meanB: Float,
    stdR: Float = 1, stdG: Float = 1, stdB: Float = 1): ChannelNormalize = {
    new ChannelNormalize(Array(meanB, meanG, meanR), Array(stdR, stdG, stdB))
  }

Notice that new ChannelNormalize(Array(meanB, meanG, meanR), Array(stdR, stdG, stdB)), mean and std have inconsistent order. This looks like a bug.

Can not load keras model issue

I used the model=Net.load_keras("file:///"+BASE_PATH+"/cpu_300000_arda_model.json") api to load my model trained by keras 1.2.2 on cpu, it will throw the error :

Traceback (most recent call last):
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 225, in
main()
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 217, in main
playGame(args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 211, in playGame
trainNetwork(model,args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 102, in trainNetwork
model=Net.load_keras("file:///"+BASE_PATH+"/gpu_keras1_2_model.json","file:///"+BASE_PATH+"/model_gpu_1540000.h5")
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/net.py", line 178, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 791, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 59, in load_weights_from_json_hdf5
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 368, in from_json_path
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 372, in from_json_str
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 213, in model_from_json
return layer_from_config(config, custom_objects=custom_objects)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/layer_utils.py", line 27, in layer_from_config
class_name = config['class_name']
TypeError: string indices must be integers

The same situation when I load the model trained by keras1.2.2 on gpu,

A problem about TFOptimizer.

When I use TFOptimizer to train a tensorflow model by slim.
I got a error.

x_rdd = sc.parallelize(images)
y_rdd = sc.parallelize(labels)
train_rdd = x_rdd.zip(y_rdd).map(lambda rec_tuple: [rec_tuple[0], np.array(rec_tuple[1])])


dataset = TFDataset.from_rdd(train_rdd,
                             names=["features", "label"],
                             shapes=[[SIZE_W, SIZE_H, 3], [1]],
                             types=[tf.float32, tf.int32])

data_images, data_labels = dataset.tensors
squeezed_labels = tf.squeeze(data_labels)
with slim.arg_scope(resnet_v1.resnet_arg_scope()):
     logits, end_points = resnet_v1.resnet_v1_200(data_images, num_classes=len(label_to_num), is_training=True)

loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=squeezed_labels))


from zoo.pipeline.api.net import TFOptimizer
from bigdl.optim.optimizer import MaxIteration, Adam, MaxEpoch, TrainSummary

optimizer = TFOptimizer(loss, Adam(1e-3))
optimizer.set_train_summary(TrainSummary("/tmp/resnet_v2", "train"))
optimizer.optimize(end_trigger=MaxEpoch(5))

I run the https://github.com/intel-analytics/analytics-zoo/blob/5212eb75956965fbedc64a0f0bb563bfc0b855b6/pyzoo/zoo/examples/tensorflow/distributed_training/train_lenet.py,get same error.

Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 44, localhost, executor driver): java.util.concurrent.ExecutionException: Layer info: TFTrainingHelper[44456754]/TFNet[5a094281]
java.lang.IllegalArgumentException: Incompatible shapes: [0] vs. [3]
	 [[Node: sparse_softmax_cross_entropy_loss/xentropy/assert_equal/Equal = Equal[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_softmax_cross_entropy_loss/xentropy/Shape_1, sparse_softmax_cross_entropy_loss/xentropy/strided_slice)]]
	at org.tensorflow.Session.run(Native Method)
	at org.tensorflow.Session.access$100(Session.java:48)
	at org.tensorflow.Session$Runner.runHelper(Session.java:298)
	at org.tensorflow.Session$Runner.run(Session.java:248)
	at com.intel.analytics.zoo.pipeline.api.net.TFNet.updateOutput(TFNet.scala:252)
	at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
	at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
	at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
	at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:264)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:264)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:264)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:202)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: Layer info: TFTrainingHelper[44456754]/TFNet[5a094281]
java.lang.IllegalArgumentException: Incompatible shapes: [0] vs. [3]
	 [[Node: sparse_softmax_cross_entropy_loss/xentropy/assert_equal/Equal = Equal[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_softmax_cross_entropy_loss/xentropy/Shape_1, sparse_softmax_cross_entropy_loss/xentropy/strided_slice)]]
	at org.tensorflow.Session.run(Native Method)
	at org.tensorflow.Session.access$100(Session.java:48)
	at org.tensorflow.Session$Runner.runHelper(Session.java:298)
	at org.tensorflow.Session$Runner.run(Session.java:248)
	at com.intel.analytics.zoo.pipeline.api.net.TFNet.updateOutput(TFNet.scala:252)
	at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
	at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
	at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
	at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

	at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:263)
	at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
	at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
	at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	... 3 more

Driver stacktrace:

Zoo keras doesn't support multiple loss

I need to implement a style transfer example with CycleGan, which combines cyclic loss and adversarial loss. And the compile method of model in keras, zoo only support single loss.

Readme/notebooks of examples need updates about "install from pip" / "download prebuilt package"

  1. For Python Notebooks/Examples README, we need to change building the project from source to install from pip and download prebuilt package and update the running commands (either by python or with scripts) accordingly.
    For Python Examples, you can refer to the following:

For Python Apps, you can do something similar.

  1. For Scala Examples README, we need to change building the project from source to download prebuilt package and update the running commands with scripts:
  1. Also, to make safe, after modifying the running command, better manually or update Jenkins scripts to make sure the command can work successfully.

  2. original issue refer to https://github.com/intel-analytics/zoo/issues/366

  3. Feel free to raise suggestions if you have a better way to elaborate the README. (edited by Kai)

WideAndDeepExample does not accept Float continuousCols

The code in WideAndDeepExample.scala seems that does not support multi value indicators , like this:

  //com.intel.analytics.zoo.models.recommendation.Utils.scala
  // setup deep tensor
  def getDeepTensor(r: Row, columnInfo: ColumnFeatureInfo): Tensor[Float] = {
    val deepColumns1 = columnInfo.indicatorCols
    val deepColumns2 = columnInfo.embedCols ++ columnInfo.continuousCols
    val deepLength = columnInfo.indicatorDims.sum + deepColumns2.length
    val deepTensor = Tensor[Float](deepLength).fill(0)

    // setup indicators
    var acc = 0
    (0 to deepColumns1.length - 1).map {
      i =>
        val index = r.getAs[Int](columnInfo.indicatorCols(i))
        val accIndex = if (i == 0) index
        else {
          acc = acc + columnInfo.indicatorDims(i - 1)
          acc + index
        }
        deepTensor.setValue(accIndex + 1, 1)
    }

    // setup embedding and continuous
    (0 to deepColumns2.length - 1).map {
      i =>
        deepTensor.setValue(i + 1 + columnInfo.indicatorDims.sum,
          r.getAs[Int](deepColumns2(i)).toFloat)
    }
    deepTensor
  }

The example calls the Utils.row2Sample() method, and the method deals with indicators & continuous Cols like the above code.

We can see that it assumes that continuousCols are integers rather than floats , and it only takes the first value of indicators. I understand that it's only a demo , but will it be better if we fix this?

Dataset creates huge Array when run WideAndDeep Model on 100gb data and 1 million features

This maybe a BigDL issue , but I came across this problem when I tried to run W&D model , thus I decide to submit it here.

Here is the problem : when I start train model on spark , I always fail on huge data and large features. Spark executor keep on running out of memory even when I set the heap memory / direct memory very large. And I notice that the storage memory used is quite low. I'm skilled on spark, so please believe me I konw how to deal with OOM on spark.

For instance, I assigned 2048GB memory total to the application , and data cached (rdd names are "training rdd" & "thread models") only occupies 100GB memory.

Then the executor fails , telling me that it cannot find a rdd file like this
spark .FetchFailedException: Failure while fetching StreamChunkId no such file exception ,
and has to re-calculate the rdd , and it fails again. The executor log shows that the MemoryStorage tries to store a single partition rdd of about 50GB , while the executor memory cannot cache such a big file in memory.

I checked the source code and found two places:

  1. First relate to cache:
//DataSet.scala 
override def cache(): Unit = {
    buffer.count()
    indexes.count()
    isCached = true
  }

when dataset tries to cache data , it simply calls an action and it triggers the code below:

  def rdd[T: ClassTag](data: RDD[T]): DistributedDataSet[T] = {
    val nodeNumber = Engine.nodeNumber()
    new CachedDistriDataSet[T](
      data.coalesce(nodeNumber, true)
        .mapPartitions(iter => {
          Iterator.single(iter.toArray)
        }).setName("cached dataset")
        .cache()
    )
  }

It calls the cache() method of spark ,which will just cache the data in memory ,even if the executor memory cannot hold the partition. I believe that's the first reason why we came across the loss of the rdd and OOM exception. To avoid this , we should just replace all cache() in dataset to persist(MEMORY_AND_DISK) or persist(MEMORY_AND_DISK_SER). This may cause the application to run more slowly ,but better slow than error ,isn't it?

  1. The second place relates to the deeper problem : why will my application create an object of 50GB ? That's really too big .
  def rdd[T: ClassTag](data: RDD[T]): DistributedDataSet[T] = {
    val nodeNumber = Engine.nodeNumber()
    new CachedDistriDataSet[T](
      data.coalesce(nodeNumber, true)
        .mapPartitions(iter => {
          Iterator.single(iter.toArray)
        }).setName("cached dataset")
        .cache()
    )
  }

Same code here. We can see that when I run my W&D application, I have about 100 million train Sample , and the code in Iterator.single(iter.toArray) turns the whole rdd into one big Array. The array is of couse huge , for it contains 100 million objects.

I just can't understand why we convert the distributed rdd into one big array , Just for zip with immutable index? We can assign index with distributed rdds , and other actions like shuffle. Is it just a bug ?

Sorry for cannot provide a demo for this issue, besides.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.