GithubHelp home page GithubHelp logo

tutorial's Introduction

Ray Tutorial

NOTE: These sets of tutorials have been deprecated. A portion of their modules have been incorporated into the new Anyscale Academy tutorials at https://github.com/anyscale/academy.

Try Ray on Google Colab

Try the Ray tutorials online using Google Colab:

Try Tune on Google Colab

Tuning hyperparameters is often the most expensive part of the machine learning workflow. Ray Tune is built to address this, demonstrating an efficient and scalable solution for this pain point.

Exercise 1 covers basics of using Tune - creating your first training function and using Tune. This tutorial uses Keras.

Tune Tutorial

Exercise 2 covers Search algorithms and Trial Schedulers. This tutorial uses PyTorch.

Tune Tutorial

Exercise 3 covers using Population-Based Training (PBT) and uses the advanced Trainable API with save and restore functions and checkpointing.

Tune Tutorial

Try Ray on Binder

Try the Ray tutorials online on Binder. Note that Binder will use very small machines, so the degree of parallelism will be limited.

Local Setup

  1. Make sure you have Python installed (we recommend using the Anaconda Python distribution). Ray works with both Python 2 and Python 3. If you are unsure which to use, then use Python 3.

    If not using conda, continue to step 2.

    If using conda, you can then run the following commands and skip the next 4 steps:

    git clone https://github.com/ray-project/tutorial
    cd tutorial
    conda env create -f environment.yml
    conda activate ray-tutorial
  2. Install Jupyter with pip install jupyter. Verify that you can start Jupyter lab with the command jupyter-lab or jupyter-notebook.
  3. Install Ray by running pip install -U ray. Verify that you can run

    import ray
    ray.init()

    in a Python interpreter.

  4. Clone the tutorial repository with

    git clone https://github.com/ray-project/tutorial.git
  5. Install the additional dependencies.

    Either install them from the given requirements.txt

    Or install them manually

    pip install modin
    pip install tensorflow
    pip install gym
    pip install scipy
    pip install opencv-python
    pip install bokeh
    pip install ipywidgets==6.0.0
    pip install keras

    Verify that you can run import tensorflow and import gym in a Python interpreter.

    Note: If you have trouble installing these Python modules, note that almost all of the exercises can be done without them.

  6. If you want to run the pong exercise (in rl_exercises/rl_exercise05.ipynb), you will need to do pip install utilities/pong_py.

Exercises

Each file exercises/exercise*.ipynb is a separate exercise. They can be opened in Jupyter lab by running the following commands.

cd tutorial/exercises
jupyter-lab

If you don't have jupyter-lab, try jupyter-notebook. If it asks for a password, just hit enter.

Instructions are written in each file. To do each exercise, first run all of the cells in Jupyter lab. Then modify the ones that need to be modified in order to prevent any exceptions from being raised. Throughout these exercises, you may find the Ray documentation helpful.

Exercise 1: Define a remote function, and execute multiple remote functions in parallel.

Exercise 2: Execute remote functions in parallel with some dependencies.

Exercise 3: Call remote functions from within remote functions.

Exercise 4: Use actors to share state between tasks. See the documentation on using actors.

Exercise 5: Pass actor handles to tasks so that multiple tasks can invoke methods on the same actor.

Exercise 6: Use ray.wait to ignore stragglers. See the documentation for wait.

Exercise 7: Use ray.wait to process tasks in the order that they finish. See the documentation for wait.

Exercise 8: Use ray.put to avoid serializing and copying the same object into shared memory multiple times.

Exercise 9: Specify that an actor requires some GPUs. For a complete example that does something similar, you may want to see the ResNet example.

Exercise 10: Specify that a remote function requires certain custom resources. See the documentation on custom resources.

Exercise 11: Extract neural network weights from an actor on one process, and set them in another actor. You may want to read the documentation on using Ray with TensorFlow.

Exercise 12: Pass object IDs into tasks to construct dependencies between tasks and perform a tree reduce.

More In-Depth Examples

Sharded Parameter Server: This exercise involves implementing a parameter server as a Ray actor, implementing a simple asynchronous distributed training algorithm, and sharding the parameter server to improve throughput.

Speed Up Pandas: This exercise involves using Modin to speed up your pandas workloads.

MapReduce: This exercise shows how to implement a toy version of the MapReduce system on top of Ray.

RL Exercises

The exercises in rl_exercises/rl_exercise*.ipynb should be done in order. They can be opened in Jupyter lab by running the following commands.

cd tutorial/rl_exercises
jupyter-lab

Exercise 1: Introduction to Markov Decision Processes.

Exercise 2: Derivative free optimization.

Exercise 3: Introduction to proximal policy optimization (PPO).

Exercise 4: Introduction to asynchronous advantage actor-critic (A3C).

Exercise 5: Train a policy to play pong using RLlib. Deploy it using actors, and play against the trained policy.

tutorial's People

Contributors

alvkao58 avatar anthonyhsyu avatar arisliang avatar barakmich avatar charlesreid1 avatar deanwampler avatar dependabot[bot] avatar devin-petersohn avatar dmatrix avatar edoakes avatar ericl avatar feynmanliang avatar holdenk avatar jiahaoyao avatar jleni avatar jzf2101 avatar krfricke avatar manner avatar pcmoritz avatar richardliaw avatar robertnishihara avatar sunstick avatar suquark avatar vinamrabenara avatar vinceshieh avatar williamma12 avatar worldveil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tutorial's Issues

[Request] More low-level examples on RLlib

I am trying to learn how to use RLlib default models/optimisers/environments but also how to create my own and add them to RLlib. I would like to see an example (could be the PPO with Cartpole) being built from scratch step by step - when I look into the code and see it whole at once I get a bit overwhelmed. Was this already done somewhere else?

Love what you have been building! Cheers.

Notes / suggestions / bugs in ray tutorials

Not sure if I'll get around to make a PR to fix some of these, so at least sharing notes:

Exercise 1: State up front that ray.get() works on arrays of object ids, not just individual object ids.

Exercise 4 and 7: make clear that you may need to add new lines, reorganize, etc.

Exercise 8:

  • boring. Make me do something, e.g. use put() vs auto-creating multiple copies, show that it's faster

Exercise 9:

  • num_gpus isn't actually set to 4.
  • success misspelled in a couple of places
  • also a bit boring. Not sure if there's something better to do without requiring actual GPUs. Maybe ask the student to print out the GPU ids in the actor and task.

Exercise 10:

  • intro. a missing piece of context is how a machine specifies what resources it provides. If I understood correctly, Ray.init specifies what resources the cluster overall has, and each task says which resources are required, but the rest isn't clear.

Exercise 11:

  • nit: telling me to use set/get_flat too late -- easy enough to fix, but better to suggest that up front.

Exercise 12:

  • treefold! Yay :)
  • I didn't notice the "you will need to refactor" comment at first and was confused. Maybe also mention in the instructions cell, not just in the code comment.

Exercise 7 typo

Exercise 7, penultimate cell:

    print('Processing result which finished after {} seconds.'
          .format(result - start_time))

the last should be (time.Time() - start_time)

RL exercise improvements

  • The RL exercises don't give feedback about whether the exercise was done correctly or not.
  • The RL exercises mostly involve running code and don't require too much modification of code.

Simplify exercise 2.

This one causes a lot of confusion. In particular, people do ray.get inside of the for loop which causes no parallelism to happen.

Measuring performance impact of pickling in exercise09.py

exercise09.py says I should do this but it's not clear how to do it:

  # Compare the performance difference with and without pickling.
  result = ray.get(ray.put(Qux(1000, 10000)))
  assert result.bar.a == 1000
  assert result.bar.b == 10000
  assert np.allclose(result.bar.c, np.ones((1000, 10000)))

Tune Tutorial: HyperOpt example crashes

I use the latest stable version of Ray and the HyperOpt example raises the following exception:
AttributeError: 'float' object has no attribute 'items'

Found out that it doesn't occur if there is no momentum in the config of experiment2, but it would probably be better to fix the underlying issue (deep_update can't replace a dict with a non-dict).

Exercise 09 duration test failed on EC2

On EC2 the value of psutil.cpu_count() corresponds to vCPUs(?) which means that the actual parallelism that can be achieved on the machines is a less than the experiment expects. As a result, even a correctly parallelized implementation fails the test.

Use a colab link instead?

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)

exercise 4 qeustion about loop

What's the difference for these two implementations of an actor?

for _ in range(5):
    results.append(f1.increment.remote())
results = ray.get(results)
results = ray.get([f1.increment.remote() for _ in range(5)])

The resuts are the same. But the first one is too fast for this problem?

RL exercise 3 can not hit the policy server

In this first part of the tutorial, after implementing run_one_episode, client.start_episode() throws an exception:
ERROR policy_client.py:115 -- Request failed.
...

The system returned: (101) Network is unreachable

The remote host or network may be down. Please try the request again.

...

I checked localhost:8900 through my web browser and its up. I was wondering if there is a bug or am I missing a point?

Sorry if this is not the best place to ask such a question. I just didn't find a better place.

Trouble installing TensorFlow (No matching distribution found for tensorflow==1.0).

Some people encountered the following problem when doing pip install tensorflow (with Anaconda).

Collecting tensorflow==1.0
  Could not find a version that satisfies the requirement tensorflow==1.0 (from versions: )
No matching distribution found for tensorflow==1.0

The solution may have been to install tensorflow with a URL like the following.

pip install --upgrade  https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.1.0-py2-none-any.whl

@richardliaw Is that correct? Do you remember what the solution was?

Run ERROR

Hi,

when I ran the example like:
python evolution_strategies.py
I got
ImportError: No module named ray_tutorial.evolution_strategies

Where can I find the "ray_tutorial" library?

Tutorial for JVM

Could you please add some simplest end-to-end when some simple algo in Java is tuned in distributed environment?

Thanks a bunch!

Excercise03 ray.get doesnot seem to wait for remote function completion

Modified code from Excercise3 is below:

@ray.remote
def compute_gradient(data):
time.sleep(0.03)
return 1

@ray.remote
def train_model(hyperparameters):
result = 0
for i in range(10):
result += sum(ray.get([compute_gradient.remote(j) for j in range(2)]))
return result

results = []
for hyperparameters in [{'learning_rate': 1e-1, 'batch_size': 100},
{'learning_rate': 1e-2, 'batch_size': 100},
{'learning_rate': 1e-3, 'batch_size': 100}]:
results.append(train_model.remote(hyperparameters))

print(results)
ray.get(results) # This line is NOT blocking ...
print(results)

end_time = time.time()
duration = end_time - start_time

===>
Output i see is below -

[ObjectID(80e4a9df518deab827ffee6ce9fde852ec326741), ObjectID(5f104cc60f425f5519c059dd72d24c9a100365a4), ObjectID(c95c5d3e4193a2bc764d27b07466ff2f6ae625b0)]

[ObjectID(80e4a9df518deab827ffee6ce9fde852ec326741), ObjectID(5f104cc60f425f5519c059dd72d24c9a100365a4), ObjectID(c95c5d3e4193a2bc764d27b07466ff2f6ae625b0)]

Seems like a bug somewhere. ray.get(results) should have blocked the second print till the results array is resolved with actual function return values.

ray.get cannt retun ObjectIDs. As per documentation it should be:

Returns: | A Python object or a list of Python objects.

Can some one comment.

ModuleNotFoundError: No module named 'ray'

When I am trying to run the exercise01.ipynb I got the following error. Can anyone help me?
`from future import absolute_import
from future import division
from future import print_function

import ray
import time`

`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
in
3 from future import print_function
4
----> 5 import ray
6 import time

ModuleNotFoundError: No module named 'ray'`

Thanks in advance.

Run multiple trials on a single GPU

I am trying to to do large scale hyper-parameter tuning. I have a local setup of 4 GPU's. My model size is small (~ 1GB), so I was thinking of running multiple trials on a single GPU so that I can parallelize tuning even more.

Even setting resources_per_trial={"gpu": 0.3} is not helping.

Is there a way I can do it ?

Please help.

Consider adding function to check if initialized

I am building an API in which multiple parts may be used independently, but which may each need to do some parallel processing. While a try catch works to prevent reinitializing ray if it is already initialized by another part of the program, it does not prevent ray from printing its initialization information, and giving a new, and probably broken, UI address, before the exception occurs. A function to check if it is already initialized would be convenient, or to return the local node. It is possible this exists and I could not find it in the documentation or in the ray module's dir members after importing. Likewise, an option to change verbosity would be convenient. That being said, ray is working amazingly well; thank you!

Evolution strategies code is Python 3-only

byte-compiling build/bdist.macosx-10.12-x86_64/egg/ray_tutorial/evolution_strategies/policies.py to policies.pyc
  File "build/bdist.macosx-10.12-x86_64/egg/ray_tutorial/evolution_strategies/policies.py", line 72
    def rollout(self, env, *, render=False, timestep_limit=None, save_obs=False, random_stream=None):
                            ^
SyntaxError: invalid syntax

byte-compiling build/bdist.macosx-10.12-x86_64/egg/ray_tutorial/evolution_strategies/tabular_logger.py to tabular_logger.pyc
  File "build/bdist.macosx-10.12-x86_64/egg/ray_tutorial/evolution_strategies/tabular_logger.py", line 83
    def log(*args, level=INFO):
                       ^
SyntaxError: invalid syntax

Batched wait

in exercise05.py, another alternative (harder) could be wait for either the first 10 or the last 10 (instead of any 10) --- would be a fun additional excercise

Unbound Local Error when running rllib_exercise02

I get this error when trying to create a PPO agent


UnboundLocalError Traceback (most recent call last)
in
6 config['num_cpus_per_worker'] = 0 # This avoids running out of resources in the notebook environment when this cell is re-executed
7
----> 8 agent = PPOTrainer(config, 'CartPole-v0')

~/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in init(self, config, env, logger_creator)
309 logger_creator = default_logger_creator
310
--> 311 Trainable.init(self, config, logger_creator)
312
313 @classmethod

~/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py in init(self, config, logger_creator)
86 self._iterations_since_restore = 0
87 self._restored = False
---> 88 self._setup(copy.deepcopy(self.config))
89 self._local_ip = ray.services.get_node_ip_address()
90

~/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in _setup(self, config)
422
423 with get_scope():
--> 424 self._init(self.config, self.env_creator)
425
426 # Evaluation related

~/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py in _init(self, config, env_creator)
61 policy = get_policy_class(config)
62 self.local_evaluator = self.make_local_evaluator(
---> 63 env_creator, policy)
64 self.remote_evaluators = self.make_remote_evaluators(
65 env_creator, policy, config["num_workers"])

~/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in make_local_evaluator(self, env_creator, policy, extra_config)
620 config["local_evaluator_tf_session_args"]
621 }),
--> 622 extra_config or {}))
623
624 @DeveloperAPI

~/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in _make_evaluator(self, cls, env_creator, policy, worker_index, config)
845 remote_env_batch_wait_ms=config["remote_env_batch_wait_ms"],
846 soft_horizon=config["soft_horizon"],
--> 847 _fake_sampler=config.get("_fake_sampler", False))
848
849 @OverRide(Trainable)

~/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/policy_evaluator.py in init(self, env_creator, policy, policy_mapping_fn, policies_to_train, tf_session_creator, batch_steps, batch_mode, episode_horizon, preprocessor_pref, sample_async, compress_observations, num_envs, observation_filter, clip_rewards, clip_actions, env_config, model_config, policy_config, worker_index, monitor_path, log_dir, log_level, callbacks, input_creator, input_evaluation, output_creator, remote_worker_envs, remote_env_batch_wait_ms, soft_horizon, _fake_sampler)
319 with self.tf_sess.as_default():
320 self.policy_map, self.preprocessors =
--> 321 self._build_policy_map(policy_dict, policy_config)
322 else:
323 self.policy_map, self.preprocessors = self._build_policy_map(

~/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/policy_evaluator.py in _build_policy_map(self, policy_dict, policy_config)
725 if tf:
726 with tf.variable_scope(name):
--> 727 policy_map[name] = cls(obs_space, act_space, merged_conf)
728 else:
729 policy_map[name] = cls(obs_space, act_space, merged_conf)

~/anaconda3/lib/python3.7/site-packages/ray/rllib/policy/tf_policy_template.py in init(self, obs_space, action_space, config, existing_inputs)
107 grad_stats_fn=grad_stats_fn,
108 before_loss_init=before_loss_init_wrapper,
--> 109 existing_inputs=existing_inputs)
110
111 if after_init:

~/anaconda3/lib/python3.7/site-packages/ray/rllib/policy/dynamic_tf_policy.py in init(self, obs_space, action_space, config, loss_fn, stats_fn, grad_stats_fn, before_loss_init, make_action_sampler, existing_inputs, get_batch_divisibility_req)
90 "prev_actions": prev_actions,
91 "prev_rewards": prev_rewards,
---> 92 "is_training": self._get_is_training_placeholder(),
93 }
94

~/anaconda3/lib/python3.7/site-packages/ray/rllib/policy/tf_policy.py in _get_is_training_placeholder(self)
310 """
311 if not hasattr(self, "_is_training"):
--> 312 self._is_training = tf.placeholder_with_default(False, ())
313 return self._is_training
314

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py in placeholder_with_default(input, shape, name)
2091 A Tensor. Has the same type as input.
2092 """
-> 2093 return gen_array_ops.placeholder_with_default(input, shape, name)
2094
2095

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py in placeholder_with_default(input, shape, name)
5923 shape = _execute.make_shape(shape, "shape")
5924 _, _, _op = _op_def_lib._apply_op_helper(
-> 5925 "PlaceholderWithDefault", input=input, shape=shape, name=name)
5926 _result = _op.outputs[:]
5927 _inputs_flat = _op.inputs

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
509 dtype=dtype,
510 as_ref=input_arg.is_ref,
--> 511 preferred_dtype=default_dtype)
512 except TypeError as err:
513 if dtype is None:

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors)
1173
1174 if ret is None:
-> 1175 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1176
1177 if ret is NotImplemented:

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
302 as_ref=False):
303 _ = as_ref
--> 304 return constant(v, dtype=dtype, name=name)
305
306

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
243 """
244 return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 245 allow_broadcast=True)
246
247

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
281 tensor_util.make_tensor_proto(
282 value, dtype=dtype, shape=shape, verify_shape=verify_shape,
--> 283 allow_broadcast=allow_broadcast))
284 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
285 const_tensor = g.create_op(

~/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
571 raise TypeError(
572 "Element type not supported in TensorProto: %s" % numpy_dtype.name)
--> 573 append_fn(tensor_proto, proto_values)
574
575 return tensor_proto

tensorflow/python/framework/fast_tensor_util.pyx in tensorflow.python.framework.fast_tensor_util.AppendBoolArrayToTensorProto()

~/anaconda3/lib/python3.7/site-packages/numpy/lib/type_check.py in asscalar(failed resolving arguments)
545 warnings.warn('np.asscalar(a) is deprecated since NumPy v1.16, use '
546 'a.item() instead', DeprecationWarning, stacklevel=1)
--> 547 return a.item()
548
549 #-----------------------------------------------------------------------------

UnboundLocalError: local variable 'a' referenced before assignment

How to turn class involving super() into a remote actor?

Hello, there
I'm new to using Ray. After completing the tutorial, I was trying to do some more exercise myself. Tensorflow tutorial seems a good choice for me. So I changed part of neural_network.py into the following lines:

# Create TF Model.
# TODO: change the NeuralNet class into an actor
@ray.remote
class NeuralNet(Model):
    # Set layers.
    def __init__(self):
        super(NeuralNet, self).__init__()
        # First fully-connected hidden layer.
        self.fc1 = layers.Dense(n_hidden_1, activation=tf.nn.relu)
        # First fully-connected hidden layer.
        self.fc2 = layers.Dense(n_hidden_2, activation=tf.nn.relu)
        # Second fully-connecter hidden layer.
        self.out = layers.Dense(num_classes, activation=tf.nn.softmax)
        print("Init NerualNet finished")

    # Set forward pass.
    def call(self, x, is_training=False):
        x = self.fc1(x)
        x = self.out(x)
        if not is_training:
            # tf cross entropy expect logits without softmax, so only
            # apply softmax when not training.
            x = tf.nn.softmax(x)
        return x

# Build neural network model.
# this will be an object with certain ID----actor handler
neural_net = NeuralNet.remote()

=====================================================================
The output shows that when executing super(NeuralNet,self).init() failed.

(pid=23354) 2020-02-26 16:04:09.469796: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
(pid=23354) 2020-02-26 16:04:09.469906: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
(pid=23354) 2020-02-26 16:04:09.469920: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

2020-02-26 16:04:15,131	ERROR worker.py:998 -- Possible unhandled error from worker: ray::NeuralNet.__init__() (pid=23354, ip=192.168.124.128)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 430, in ray._raylet.execute_task.function_executor
  File "<ipython-input-17-f79ce8a34c46>", line 7, in __init__
TypeError: super() argument 1 must be type, not ActorClass(NeuralNet)

OS: ubuntu18.04
python3: 3.7.6
tensorflow: 2.0

Best regards
neoBlue

Simplify exercise 3.

One issue is that people call ray.get inside of train_model outside of the for loop (as opposed to inside the for loop). This causes the timing to be too fast.

Another plea to add check if initialized

I have a usecase I find hard to solve with the ray API as it currently exists:

if users of my library have not initialized ray, I want to run in single-threaded mode, without using ray. Therefore I need to know whether ray has been initialized.

The current way I solve this is

if “ray” in sys.modules:
    try:
        # run dummy function with ray
        # to see if it throws not-init exception
        ray_initialized = True
    except:
        ray_initialized = True

Then I can do processing downstream based on this. But using try except is not as robust as it could be, and it is also verbose.

exercise 8: why more experiment causes the get_logs to not return?

The exercise suggested "You can also try running more of the experiment tasks and see what happens". While, after changing up the experiment runs, the get_logs would not immediately return results anymore.

I'm curious of why is this behavior? Does it have something to do the number of CPU that the number of experiments can't be more than the number of CPU - 1?

error in tune exercise 2

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 438, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 351, in fetch_result
result = ray.get(trial_future[0])
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2121, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray_WrappedTrackFunc:train() (pid=1561, ip=172.28.0.2)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 176, in train
result = self._train()
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 199, in _train
self._report_thread_runner_error(block=True)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 242, in _report_thread_runner_error
.format(err_tb_str)))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray_WrappedTrackFunc:train() (pid=1561, ip=172.28.0.2)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 97, in run
self._entrypoint()
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 145, in entrypoint
return self._trainable_func(config, self.status_reporter)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 271, in trainable_func
output = train_func(config)
File "", line 9, in train_mnist
File "/usr/local/lib/python3.6/dist-packages/ray/tune/examples/mnist_pytorch.py", line 50, in train
optimizer.step()
File "/usr/local/lib/python3.6/dist-packages/torch/optim/sgd.py", line 106, in step
p.data.add
(-group['lr'], d_p)
TypeError: add
() takes 1 positional argument but 2 were given

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.