The zoo-tutorials from intel-analytics

library path for sbt build SimpleMlp project is not correct

Tried to use "sbt assembly" to build SimpleMlp, the jar is put under target/scala-2.11/ dir and the jar name is simplemlp-assembly-0.1.0-SNAPSHOT.jar

spark-core_2.11 CVE issue

https://github.com/intel-analytics/zoo-tutorials/security/dependabot/1

The PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON env should be removed in keras tutorials

%env PYSPARK_PYTHON=/usr/bin/python3.5
%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5

These environment variable should be removed, as user may not using python3.5 and even if they are using python3.5, the path may not be /usr/bin/python3.5 (e.g. in conda env).

If there are some specific cases where these should be, I think it is better to state it in the readme.

Cannot split training into loops

In orca pytorch notebook 6.2 #42 , we need to collect training accuracy/loss, validation accuracy/loss in each epoch. However, orca does not save the training/validation accuracy automatically. So I use a loop to train the model (1 epoch in each loop), to obtain and save the training accuracy. I did the same in orca keras, however, it does not seem to work here in pytorch.

In the notebook https://github.com/intel-analytics/zoo-tutorials/blob/8b705c9134337c78e93c72a82c5fd0a92ef3c879/orca/pytorch/6.2-problem.ipynb, I do

for i in range(1, num_epochs + 1):
    est.set_tensorboard("./log/", "epoch_" + str(i))
    print("\nfit ", i, "start\n")
    est.fit(data=train_loader, epochs=1, validation_data=val_loader, batch_size=batch_size, 
            checkpoint_trigger=EveryEpoch())
    print("\nfit ", i, "end\n")

    print("Get training accuracy: ")
    train_acc_tmp = est.evaluate(data=train_loader, batch_size=batch_size)
    train_acc.append(train_acc_tmp["Top1Accuracy"])
    train_loss_tmp = [_[1] for _ in est.get_train_summary("Loss")]
    train_loss.append(sum(train_loss_tmp) / len(train_loss_tmp))

The result shows that after the first loop, the training does not continue anymore. The training process does not even started in loop 2:

fit  2 start

creating: createEveryEpoch
creating: createMaxEpoch
2021-03-08 16:44:48 INFO  DistriOptimizer$:818 - caching training rdd ...
2021-03-08 16:45:13 INFO  DistriOptimizer$:161 - Count dataset
Warn: jep.JepException: <class 'StopIteration'>
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:108)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2021-03-08 16:45:13 INFO  DistriOptimizer$:165 - Count dataset complete. Time elapsed: 0.186302138s
Warn: jep.JepException: <class 'StopIteration'>
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:108)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2021-03-08 16:45:13 WARN  DistriOptimizer$:167 - If the dataset is built directly from RDD[Minibatch], the data in each minibatch is fixed, and a single minibatch is randomly selected in each partition. If the dataset is transformed from RDD[Sample], each minibatch will be constructed on the fly from random samples, which is better for convergence.
2021-03-08 16:45:13 INFO  DistriOptimizer$:173 - config  {
	computeThresholdbatchSize: 100
	maxDropPercentage: 0.0
	warmupIterationNum: 200
	isLayerwiseScaled: false
	dropPercentage: 0.0
 }
2021-03-08 16:45:13 INFO  DistriOptimizer$:177 - Shuffle data
2021-03-08 16:45:13 INFO  DistriOptimizer$:180 - Shuffle data complete. Takes 1.56775E-4s

fit  2 end

I also tried to create both the data_loader instances and the estimator inside the loop, which does make the training processes occur in each loop, but the result is like "restarting" the whole training process from time to time (the training accuracy is the same in each loop?)

As a reference, here is a notebook that does not train in a loop which works fine: https://github.com/intel-analytics/zoo-tutorials/blob/8b705c9134337c78e93c72a82c5fd0a92ef3c879/orca/pytorch/6.2-understanding-recurrent-neural-networks.ipynb.

intel-analytics / zoo-tutorials Goto Github PK

zoo-tutorials's People

Contributors

Stargazers

Watchers

Forkers

zoo-tutorials's Issues

library path for sbt build SimpleMlp project is not correct

spark-core_2.11 CVE issue

The PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON env should be removed in keras tutorials

Cannot split training into loops

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs