GithubHelp home page GithubHelp logo

zoo-tutorials's People

Contributors

charlenehu94 avatar dependabot[bot] avatar forest216 avatar gzhoffie avatar hkvision avatar hzjane avatar jason-dai avatar le-zheng avatar qiuxin2012 avatar xiejinglei avatar yangw1234 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

hzjane

zoo-tutorials's Issues

The PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON env should be removed in keras tutorials

%env PYSPARK_PYTHON=/usr/bin/python3.5
%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5

These environment variable should be removed, as user may not using python3.5 and even if they are using python3.5, the path may not be /usr/bin/python3.5 (e.g. in conda env).

If there are some specific cases where these should be, I think it is better to state it in the readme.

Cannot split training into loops

In orca pytorch notebook 6.2 #42 , we need to collect training accuracy/loss, validation accuracy/loss in each epoch. However, orca does not save the training/validation accuracy automatically. So I use a loop to train the model (1 epoch in each loop), to obtain and save the training accuracy. I did the same in orca keras, however, it does not seem to work here in pytorch.

In the notebook https://github.com/intel-analytics/zoo-tutorials/blob/8b705c9134337c78e93c72a82c5fd0a92ef3c879/orca/pytorch/6.2-problem.ipynb, I do

for i in range(1, num_epochs + 1):
    est.set_tensorboard("./log/", "epoch_" + str(i))
    print("\nfit ", i, "start\n")
    est.fit(data=train_loader, epochs=1, validation_data=val_loader, batch_size=batch_size, 
            checkpoint_trigger=EveryEpoch())
    print("\nfit ", i, "end\n")

    print("Get training accuracy: ")
    train_acc_tmp = est.evaluate(data=train_loader, batch_size=batch_size)
    train_acc.append(train_acc_tmp["Top1Accuracy"])
    train_loss_tmp = [_[1] for _ in est.get_train_summary("Loss")]
    train_loss.append(sum(train_loss_tmp) / len(train_loss_tmp))

The result shows that after the first loop, the training does not continue anymore. The training process does not even started in loop 2:

fit  2 start

creating: createEveryEpoch
creating: createMaxEpoch
2021-03-08 16:44:48 INFO  DistriOptimizer$:818 - caching training rdd ...
2021-03-08 16:45:13 INFO  DistriOptimizer$:161 - Count dataset
Warn: jep.JepException: <class 'StopIteration'>
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:108)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2021-03-08 16:45:13 INFO  DistriOptimizer$:165 - Count dataset complete. Time elapsed: 0.186302138s
Warn: jep.JepException: <class 'StopIteration'>
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:108)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at com.intel.analytics.zoo.common.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:107)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2021-03-08 16:45:13 WARN  DistriOptimizer$:167 - If the dataset is built directly from RDD[Minibatch], the data in each minibatch is fixed, and a single minibatch is randomly selected in each partition. If the dataset is transformed from RDD[Sample], each minibatch will be constructed on the fly from random samples, which is better for convergence.
2021-03-08 16:45:13 INFO  DistriOptimizer$:173 - config  {
	computeThresholdbatchSize: 100
	maxDropPercentage: 0.0
	warmupIterationNum: 200
	isLayerwiseScaled: false
	dropPercentage: 0.0
 }
2021-03-08 16:45:13 INFO  DistriOptimizer$:177 - Shuffle data
2021-03-08 16:45:13 INFO  DistriOptimizer$:180 - Shuffle data complete. Takes 1.56775E-4s

fit  2 end

I also tried to create both the data_loader instances and the estimator inside the loop, which does make the training processes occur in each loop, but the result is like "restarting" the whole training process from time to time (the training accuracy is the same in each loop?)

As a reference, here is a notebook that does not train in a loop which works fine: https://github.com/intel-analytics/zoo-tutorials/blob/8b705c9134337c78e93c72a82c5fd0a92ef3c879/orca/pytorch/6.2-understanding-recurrent-neural-networks.ipynb.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.