GithubHelp home page GithubHelp logo

Comments (16)

crr0004 avatar crr0004 commented on August 17, 2024

from deepracer-core.

sctse999 avatar sctse999 commented on August 17, 2024

interestingly the log from the console of sagemaker stopped there without further output. Any idea I can set anything to get a more verbose output?

p.s.: Thanks a lot for sharing this project

from deepracer-core.

sctse999 avatar sctse999 commented on August 17, 2024

I am suspecting the script is terminated once it get passed 60 minutes. I just can't find the parameter that control this.

from deepracer-core.

crr0004 avatar crr0004 commented on August 17, 2024

There is a method which controls exiting in the simulation, I don't recall if it's also checked in the sagemaker part.

    def is_training_done(self):
        if ((self.target_number_of_episodes > 0) and (self.target_number_of_episodes == self.episodes)) or \
           ((self.is_number(self.target_reward_score)) and (self.target_reward_score <= self.reward_in_episode)):
            self.is_simulation_done = True
        return self.is_simulation_done

Is there anything further up in the sagemaker container output? You can try using tee to capture the log and see it at the same time

from deepracer-core.

Nicolas-Kuhl avatar Nicolas-Kuhl commented on August 17, 2024

I'm getting the same error whenever it finishes 20 episodes and sagemaker needs to update the model - can't find any error message beyond the one above.

Were you are able to get any more info on the issue? Anywhere else I can try to get logs from?

from deepracer-core.

sctse999 avatar sctse999 commented on August 17, 2024

I am not sure why but the issue is gone.
I did two things:

  1. I moved to an ubuntu desktop
  2. I am using the latest code of this repo (probably the updated docker image is the reason)

from deepracer-core.

Nicolas-Kuhl avatar Nicolas-Kuhl commented on August 17, 2024

Ah, I've been using it on mac - might have to try ubuntu instead.

Oddly enough I've switched from the Tokyo track to the re:Invent and don't seem to be getting the same problem.

from deepracer-core.

joezen777 avatar joezen777 commented on August 17, 2024

I'm using the longer track. Had the same problems on my mac pro at home. On my other computer I made sure to update the robomaker.env with the correct local IP Address. (not the 127 one but the one I get from the router.) I'm thinking this could've been it? I'll verify when I get home. Makes sense that the error happens when it tries to talk back to sagemaker to start the training epochs.

from deepracer-core.

Nicolas-Kuhl avatar Nicolas-Kuhl commented on August 17, 2024

In my case it's not the IP address (have that part done already) - I can get it to work now when training a new model, but whenever starting a job using an existing model checkpoint (i.e. continuing to train an existing) it does this on the first model update... the lack of error message is making this hard to track down!

from deepracer-core.

joezen777 avatar joezen777 commented on August 17, 2024

I trudged through every theory from permissions to virus scan when I got home. I ended up deleting all my folders and tried starting from scratch. Now I'm getting errors about tempfiles in robo/containers not existing.

from deepracer-core.

crr0004 avatar crr0004 commented on August 17, 2024

There could be any reason for the error. That error message is just saying that python encountered an error. There is normally a stack trace further up.

from deepracer-core.

joezen777 avatar joezen777 commented on August 17, 2024

Beats me what's causing my error now. Looks like it's a temp folder creation issue but it's got all the permissions it needs. I still have everything working fine on one computer though.

S03734-MBPR:rl_coach josephn$ ipython rl_deepracer_coach_robomaker.py 
Looking for config file: /Users/josephn/.sagemaker/config.yaml
Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker
Uploading to s3://bucket/rl-deepracer-sagemaker
WARNING:sagemaker:Parameter `image_name` is specified, `toolkit`, `toolkit_version`, `framework` are going to be ignored when choosing the image.
s3.ServiceResource()
Using provided s3_client
INFO:sagemaker:Creating training-job with name: rl-deepracer-sagemaker
Starting training job
...

FileNotFoundError: [Errno 2] No such file or directory: '/Users/josephn/Documents/deepracer/robo/container/tmpz16mdufb'`

from deepracer-core.

crr0004 avatar crr0004 commented on August 17, 2024

from deepracer-core.

joezen777 avatar joezen777 commented on August 17, 2024

Yeah the error was not enough sleep. I had created "containers" instead of "container". And fortunately I am happy to say that everything is working after deleting all repo folders and then re-running through the steps. The only thing that I didn't test before removing and re-setting up, was if some of my changes to the hyperparameters had screwed things up. I'll also say that it took a long time before the policy training started (couple minutes) and it ran the epochs much slower than my other computer. I changed my cpus for docker from 8 to 6 after this and that made VNC run a little better but little else. But either way it's a success. Thanks again. If you're ever in NY before AWS NY Summit, hit me up and I'll show you the Slalom office in the new World Trade Center. (I'm building a custom track in there as well.)

from deepracer-core.

Nicolas-Kuhl avatar Nicolas-Kuhl commented on August 17, 2024

Finally got round to sorting mine out - seems that increasing the max memory \ cpu in docker preferences fixed it - not getting that particular error anymore.

from deepracer-core.

crr0004 avatar crr0004 commented on August 17, 2024

@joezen777 @Nicolas-Kuhl are we happy to keep this closed? I am closing this and if we need to re-open it I can

from deepracer-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.