aws-deepracer-community / deepracer-core Goto Github PK

A repository binding together everything needed for DeepRacer local.

Python 98.01% Shell 1.49% Dockerfile 0.50%

deepracer-core's Introduction

This repository is archived as all needed code is in the Simapp/Robomaker repository

DeepRacer Core

The DeepRacer Core repository is a utility that is pulling together the different components required for DeepRacer local training. It is not meant for direct usage. If you are looking for the end-user interface to run training locally please go to Deepracer-for-Cloud

Main Components

The primary components of DeepRacer are four docker containers:

Robomaker Container: Responsible for the robotics environment. Based on ROS + Gazebo as well as the AWS provided "Bundle". Uses components of AWS Robomaker
Sagemaker Container: Responsible for training the neural network. Uses components of AWS Sagemaker
Reinforcement Learning (RL) Coach: Responsible for preparing and starting the Sagemaker environment.
Log-Analysis: Providing a containerized Jupyter Notebook for analyzing the logfiles generated. Uses Deepracer Utils.

Building

The different sub-modules will include build.sh scripts.

Requirements:

Scripts made assuming a Ubuntu 20.04 installation.
New version of Docker (~version 24), including BuildX (docker-buildx-plugin)

Builds

The built Docker Containers can be found on Docker Hub:

deepracer-core's People

Contributors

Stargazers

Watchers

Forkers

sd3z 10antz22 nicklovehub sctse999 ratismal minjimwu nicolas-kuhl vincentpham1991 joezen777 noindyfikator aki-zhangway breadcentric erupare kebin8 michael-equi jarrettj ttnaustin arcc-race yuexialiao08 yongaryz thimico kimsuhyuk t04glovern carl297r emsi oleey yogeshsharma0201 deepchatterjeevns ikeyasu anandprabhakar0507 rpidanny michael-schulze fluffymango go-baek kavymp amir22010 jochem725 onedudedesign hhamoudi sunshane6726 waynehorse zyc9 ricardofideles justinguese mayurmadnani gdssrao ammendonca poconnell7 mattcamp avain freedragon daj baeksangwon lacan82 richardfan1126 wesky93 vinitbipinjain maroussil maimuzo rishistyping djd4352 gantners davidv-training torrinlynn ronald-d-rogers kurtoid vamsirajendra ronanlee dankwartrustow binhtrantt tiavila hamzaboukraa romankoles crouffer yoonian ccortezb currycurrycurry hafidzdaud motoyama-lvgs joskid budavarapu ruparee elijahkalii goodnowcurr30 mjpatin1 alaomichael alexforks anyagaki my-machine-learning-projects-ct rnocom peterpanstechland plpang chikakorooney adam-aph freddie20 bryanbroussard-99 shamsherthind drbothen bohica-labs bhwi

deepracer-core's Issues

Allocator (GPU_0_bfc) ran out of memory

After the interaction 60, before the policy training, I received a OOM warning followed by a crash. Have tried several things but no success. Running on a GPU 1050 with 4GB, that works for most people. Does anyone knows a solution?

Debian 10 amd gpu - Import Error: librccl.so

I have attempted to get this running, but have run into an import error (librccl.so)

When I search for that file I find a few. Has anyone dealt with this issue before?

(sagemaker_venv) raphy@raphy:~/Projects/deepracer/rl_coach$ sudo find / -name librccl.so -exec file {} \; -exec printf "\n" \;
find: ‘/run/user/1000/gvfs’: Permission denied
/opt/rocm/rccl/lib/librccl.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped

/opt/rocm/lib/librccl.so: symbolic link to ../rccl/lib/librccl.so

/home/raphy/Projects/virtual_envs/tf_rocm/lib/python3.7/site-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, not stripped

/var/lib/docker/overlay2/b64506e1a7e7ffdcb7dd183e73cd5d79fcb5f5e3ce847c2103180f2cdd13bd5c/diff/usr/local/lib/python3.6/dist-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, not stripped

Error:

(sagemaker_venv) raphy@raphy:~/Projects/deepracer/rl_coach$ python rl`_deepracer_coach_robomaker.py 
Looking for config file: /home/raphy/.sagemaker/config.yaml
Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker
Uploading to s3://bucket/rl-deepracer-sagemaker
WARNING:sagemaker:Parameter `image_name` is specified, `toolkit`, `toolkit_version`, `framework` are going to be ignored when choosing the image.
s3.ServiceResource()
Using provided s3_client
INFO:sagemaker:Creating training-job with name: rl-deepracer-sagemaker
Starting training job
Using /home/raphy/Projects/deepracer/containers for container temp files
Using /home/raphy/Projects/deepracer/containers for container temp files
Trying to launch image: crr0004/sagemaker-rl-tensorflow:amd
Creating tmp9whzr7mk_algo-1-g8xmx_1 ... done
Attaching to tmp9whzr7mk_algo-1-g8xmx_1
algo-1-g8xmx_1  | $1 is train
algo-1-g8xmx_1  | In train start.sh
algo-1-g8xmx_1  | Current host is "algo-1-g8xmx"
algo-1-g8xmx_1  | Compiling changehostname.c
algo-1-g8xmx_1  | Done Compiling changehostname.c
algo-1-g8xmx_1  | 21:C 07 Aug 2019 17:22:17.267 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
algo-1-g8xmx_1  | 21:C 07 Aug 2019 17:22:17.267 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=21, just started
algo-1-g8xmx_1  | 21:C 07 Aug 2019 17:22:17.267 # Configuration loaded
algo-1-g8xmx_1  |                 _._                                                  
algo-1-g8xmx_1  |            _.-``__ ''-._                                             
algo-1-g8xmx_1  |       _.-``    `.  `_.  ''-._           Redis 5.0.5 (00000000/0) 64 bit
algo-1-g8xmx_1  |   .-`` .-```.  ```\/    _.,_ ''-._                                   
algo-1-g8xmx_1  |  (    '      ,       .-`  | `,    )     Running in standalone mode
algo-1-g8xmx_1  |  |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
algo-1-g8xmx_1  |  |    `-._   `._    /     _.-'    |     PID: 21
algo-1-g8xmx_1  |   `-._    `-._  `-./  _.-'    _.-'                                   
algo-1-g8xmx_1  |  |`-._`-._    `-.__.-'    _.-'_.-'|                                  
algo-1-g8xmx_1  |  |    `-._`-._        _.-'_.-'    |           http://redis.io        
algo-1-g8xmx_1  |   `-._    `-._`-.__.-'_.-'    _.-'                                   
algo-1-g8xmx_1  |  |`-._`-._    `-.__.-'    _.-'_.-'|                                  
algo-1-g8xmx_1  |  |    `-._`-._        _.-'_.-'    |                                  
algo-1-g8xmx_1  |   `-._    `-._`-.__.-'_.-'    _.-'                                   
algo-1-g8xmx_1  |       `-._    `-.__.-'    _.-'                                       
algo-1-g8xmx_1  |           `-._        _.-'                                           
algo-1-g8xmx_1  |               `-.__.-'                                               
algo-1-g8xmx_1  | 
algo-1-g8xmx_1  | 21:M 07 Aug 2019 17:22:17.268 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
algo-1-g8xmx_1  | 21:M 07 Aug 2019 17:22:17.268 # Server initialized
algo-1-g8xmx_1  | 21:M 07 Aug 2019 17:22:17.268 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
algo-1-g8xmx_1  | 21:M 07 Aug 2019 17:22:17.268 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
algo-1-g8xmx_1  | 21:M 07 Aug 2019 17:22:17.268 * Ready to accept connections
algo-1-g8xmx_1  | 07/08/2019 17:22:17 passing arg to libvncserver: -rfbport
algo-1-g8xmx_1  | 07/08/2019 17:22:17 passing arg to libvncserver: 5800
algo-1-g8xmx_1  | 07/08/2019 17:22:17 x11vnc version: 0.9.13 lastmod: 2011-08-10  pid: 22
algo-1-g8xmx_1  | 07/08/2019 17:22:17 
algo-1-g8xmx_1  | 07/08/2019 17:22:17 wait_for_client: WAIT:0
algo-1-g8xmx_1  | 07/08/2019 17:22:17 
algo-1-g8xmx_1  | 07/08/2019 17:22:17 initialize_screen: fb_depth/fb_bpp/fb_Bpl 24/32/2560
algo-1-g8xmx_1  | 07/08/2019 17:22:17 
algo-1-g8xmx_1  | 07/08/2019 17:22:17 Listening for VNC connections on TCP port 5800
algo-1-g8xmx_1  | 07/08/2019 17:22:17 Listening for VNC connections on TCP6 port 5900
algo-1-g8xmx_1  | 07/08/2019 17:22:17 Listening also on IPv6 port 5800 (socket 6)
algo-1-g8xmx_1  | 07/08/2019 17:22:17 
algo-1-g8xmx_1  | 
algo-1-g8xmx_1  | The VNC desktop is:      ceca255b3de9:5800
algo-1-g8xmx_1  | 07/08/2019 17:22:17 possible alias:    ceca255b3de9::5800
algo-1-g8xmx_1  | PORT=5800
algo-1-g8xmx_1  | Reporting training FAILURE
algo-1-g8xmx_1  | framework error: 
algo-1-g8xmx_1  | Traceback (most recent call last):
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
algo-1-g8xmx_1  |     from tensorflow.python.pywrap_tensorflow_internal import *
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
algo-1-g8xmx_1  |     _pywrap_tensorflow_internal = swig_import_helper()
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
algo-1-g8xmx_1  |     _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
algo-1-g8xmx_1  |   File "/usr/lib/python3.6/imp.py", line 243, in load_module
algo-1-g8xmx_1  |     return load_dynamic(name, filename, file)
algo-1-g8xmx_1  |   File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
algo-1-g8xmx_1  |     return _load(spec)
algo-1-g8xmx_1  | ImportError: librccl.so: cannot open shared object file: No such file or directory
algo-1-g8xmx_1  | 
algo-1-g8xmx_1  | During handling of the above exception, another exception occurred:
algo-1-g8xmx_1  | 
algo-1-g8xmx_1  | Traceback (most recent call last):
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 60, in train
algo-1-g8xmx_1  |     framework = importlib.import_module(framework_name)
algo-1-g8xmx_1  |   File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-g8xmx_1  |     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-g8xmx_1  |   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-g8xmx_1  |   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-g8xmx_1  |   File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
algo-1-g8xmx_1  |   File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
algo-1-g8xmx_1  |   File "<frozen importlib._bootstrap_external>", line 678, in exec_module
algo-1-g8xmx_1  |   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_tensorflow_container/training.py", line 24, in <module>
algo-1-g8xmx_1  |     import tensorflow as tf
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/__init__.py", line 28, in <module>
algo-1-g8xmx_1  |     from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
algo-1-g8xmx_1  |     from tensorflow.python import pywrap_tensorflow
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
algo-1-g8xmx_1  |     raise ImportError(msg)
algo-1-g8xmx_1  | ImportError: Traceback (most recent call last):
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
algo-1-g8xmx_1  |     from tensorflow.python.pywrap_tensorflow_internal import *
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
algo-1-g8xmx_1  |     _pywrap_tensorflow_internal = swig_import_helper()
algo-1-g8xmx_1  |   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
algo-1-g8xmx_1  |     _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
algo-1-g8xmx_1  |   File "/usr/lib/python3.6/imp.py", line 243, in load_module
algo-1-g8xmx_1  |     return load_dynamic(name, filename, file)
algo-1-g8xmx_1  |   File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
algo-1-g8xmx_1  |     return _load(spec)
algo-1-g8xmx_1  | ImportError: librccl.so: cannot open shared object file: No such file or directory
algo-1-g8xmx_1  | 
algo-1-g8xmx_1  | 
algo-1-g8xmx_1  | Failed to load the native TensorFlow runtime.

Install script

Here is an install script I have written for Redhat/Centos :
https://gist.github.com/bhannebipro/70c090d9bebe6dc3a16947d99418c3ac.js

The runscript that goes with it
https://gist.github.com/bhannebipro/cabafcf9e6a13da1ab8b8369968e7c84.js

Installation on the Redhat 7.6 VM

You need a redhat 7.6 with a valid subscript it shoudl work also on centos.
The best is to have a linux user with sudo rights with NO PASSWD
From the linux user :
mkdir ~/dr (important for some config files)
cd ~/dr
copy the install-dr.sh and rundr.sh files
./install-dr.sh
It will
install some rpms docker, python, ....
do the git clone of the crr004 repo
install S3 sim minio and configure it
prep the python env for sagemaker to run
download container images

The directory structure is then
~/dr/deepracer # The content comming from crr0004
~/dr/data_minio # The S3 Data with custom_files (reward, action model)
~/dr/robo # Content of the robomaker containers created by crr0004 scripts
~/dr/utils # Some utility scripts

RUN a simulation

You will need multiple terminal to run simulation

TERMINAL1 -
cd ~/dr
./rundr.sh minio
Customize your reward functions in ~/dr/deepracer/custom_files
Copy them in the S2 minio bucket
cp -r ~/dr/deepracer/custom_files ~/dr/data_minio/custom_files
./rundr.sh sagemaker
The last message should be " saved intermediate frozen graph: rl-deepracer-sagemaker/model/model_0.pb"
TERMINAL2 -
cd ~/dr
./rundr.sh robomaker
TERMINAL3 -
vncviewer localhost:8080 &
You can navigate in the gazebo 3D windows to follow the pogress of your training
On the rviz application. You can add the first person view. By adding a windows Displays/add/By topic/zed/rb/.. select then image

Remarks

Please do it on a dirty/vm hosts as we install stuff and disable some security stuffs to make it work.
Ideally you should not have iptables and selinux as to be disabled.
I tried to setup Nvidia to use GPU but it is still really complex (fowllow the crr004 issue)
deepracer requires old version of python librairies there are some tricks in it
I patch env.sh to change the hostname function to something better for me. if you have trouble conecting to s3 minio storage it can be the cause of it.

Trying to launch image: crr0004/sagemaker-rl-tensorflow:nvidia, in docker logs, when I want it to run on console as I don't have GPU

Does the sagemaker docker image support GPU?

It looks like it doesn't support GPU. Or did I miss something?

rospy module not found

I am training on my mac and when I run the rl_deepracer_coach_robomaker.py file I get this error
import rospy
algo-1-txh1d_1 | ModuleNotFoundError: No module named 'rospy'
algo-1-txh1d_1 | 2019-08-20 11:12:29,077 sagemaker-containers ERROR ExecuteUserScriptError:

please help

Update robomaker is failing

I updated both docker images today to be able to use the new world.
Now I have the following problem:
After 1 training period SageMaker is at:
saved intermediate frozen graph: rl-deepracer-sagemaker/model/model_0.pb

Robomaker is stuck at:
reward: 123456

for several minutes and then shows this error:
Could not connect to the endpoint URL: "https://robomaker.us-east1.amazonaws.com/cancelSimulationJob"

With the old track everything was working

Diagram and explination of how everything sits together

As too much of how this all works is sitting in my head, I will be doing a diagram and explination of how this all works. There are several patches of AWS code and setup to get it all working, and it's not obvious how this all works.

I have a done an article on how DeepRacer works but not how this repo changes parts of it.

Any input on this is appreciated. Things like what is most confusing, what you'd like to know, etc.

ExecuteUserScriptError during training

Training lasted for 5x minutes and throw a ExecuteUserScriptError

algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=73, Total reward=127.31, Steps=5404, Training iteration=3
algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=74, Total reward=124.84, Steps=5571, Training iteration=3
algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=75, Total reward=106.79, Steps=5748, Training iteration=3
algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=76, Total reward=59.27, Steps=5833, Training iteration=3
algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=77, Total reward=74.12, Steps=5928, Training iteration=3
algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=78, Total reward=8.67, Steps=5960, Training iteration=3
algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=79, Total reward=11.85, Steps=6000, Training iteration=3
algo-1-71gs6_1  | Training> Name=main_level/agent, Worker=0, Episode=80, Total reward=10.17, Steps=6030, Training iteration=3
2019-06-06 10:44:23,909 sagemaker-containers ERROR    ExecuteUserScriptError: algo-1-71gs6_1  | Command "/usr/bin/python training_worker.py --RLCOACH_PRESET deepracer --aws_region us-east-1 --loss_type mean squared error --model_metadata_s3_key s3://bucket/custom_files/model_metadata.json --s3_bucket bucket --s3_prefix rl-deepracer-sagemaker"

I encountered this error on my mac and also ubuntu. Anyone encountered the same issue?

Cannot train my pretrained model

https://i.imgur.com/v5SvEm6.png
I can load the pretrained model by ./start.sh. However, the reward seems being initialized. And the actions it does also seems being initialized.

Unknown runtime specified nvidia

I get the following error when trying to run rl_deepracer_coach_robomaker.py

2019-08-06T09:38:17.946603612Z WARNING:sagemaker:Parameter `image_name` is specified, `toolkit`, `toolkit_version`, `framework` are going to be ignored when choosing the image.
2019-08-06T09:38:17.993073120Z INFO:sagemaker:Creating training-job with name: rl-deepracer-sagemaker
2019-08-06T09:38:18.643588449Z Looking for config file: /root/.sagemaker/config.yaml
2019-08-06T09:38:18.643611071Z Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker
2019-08-06T09:38:18.643616677Z Uploading to s3://bucket/rl-deepracer-sagemaker
2019-08-06T09:38:18.643620927Z s3.ServiceResource()
2019-08-06T09:38:18.643624895Z Using provided s3_client
2019-08-06T09:38:18.643628903Z Starting training job
2019-08-06T09:38:18.643632703Z Using /robo/container for container temp files
2019-08-06T09:38:18.643636581Z Using /robo/container for container temp files
2019-08-06T09:38:18.643640282Z Trying to launch image: crr0004/sagemaker-rl-tensorflow:nvidia
Creating tmpoe1yfsfl_algo-1-myjaq_1 ... error
2019-08-06T09:38:18.643648237Z 
2019-08-06T09:38:18.643652975Z ERROR: for tmpoe1yfsfl_algo-1-myjaq_1  Cannot create container for service algo-1-m yjaq: Unknown runtime specified nvidia
2019-08-06T09:38:18.643657185Z 
2019-08-06T09:38:18.643660989Z ERROR: for algo-1-myjaq  Cannot create container for service algo-1-myjaq: Unknown runtime specified nvidia
2019-08-06T09:38:18.643665071Z Encountered errors while bringing up the project.

As far as I understand the problem, error arises when RLEstimator() tries to launch image crr0004/sagemaker-rl-tensorflow:nvidia with flag --runtime nvidia that was needed for nvidia-docker2 and is not needed anymore for nvidia-container-runtime as explained here.

It might be the problem lies with the sagemaker and not this project, but I couldn't find the part of the code that actually tries to launch the image, so I'm asking for some direction where to submit an issue if this project is not the right place.

Seems cannot train pretrained model

these are the values of the reward supposed to be
https://i.imgur.com/MPAmU9M.png

what I get when I tried to retrain a model
https://i.imgur.com/GU1Tv7U.png

I've uncommented those two lines
https://i.imgur.com/jyTutI7.png

Besides the reward, the actions in retraining also look like random. I think I just started a new training instead of retraining a pretrained model? How to solve it? Thank you.

Can't run AWS RoboMaker Sample Application Locally: No module named envs.local_worker

I am trying to run RoboMaker locally with the command roslaunch deepracer_simulation local_training.launch. I have set everything up as in the tutorial and made sure that the proper environment variables were set. However, I received this error when trying to launch the file +python3 -m envs.local_worker /usr/bin/python3: No module named envs.local_worker which causes the whole launch to come down. I have verified that all other nodes are launching correctly.

Any help/guidance is appreciated.

Thanks,
Michael

Sagemaker dependencies

Hi,

Good day.

Could you add to the Sagemaker section?:

pip install urllib3==1.24.3
pip install PyYAML==3.13
pip install ipython

From https://medium.com/@jonathantse/train-deepracer-model-locally-with-gpu-support-29cce0bdb0f9.

Keep up the good work!

Forcing CPU image, meant to be using GPU image

Hi Chris,

Your recent updates has introduced the following line to rl_coach/rl_deepracer_coach_robomaker.py :

image_name="crr0004/sagemaker-rl-tensorflow:console",

This obviously forces the build to use your SM image without GPU support. Took me a while to realise why my training had slowed down, and why the console output was showing GPUS=0. I've now commented that line out, and it's reverted to using my GPU-enabled image, now showing GPUS=1.

Evaluation fails with an error "The security token included in the request is invalid"

botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the Cancel Simulation Job operation: The security token included in the request is invalid.

gnome -x deprecated

When I run the ./start.sh in training folder it shows that -x is deprecated and should be replace with --, however when I change it the result is that a new terminal is open and "sh:1 !!: is not found" message is shown.

May I know how to resolve this? also what is the -x command?

Why rl_coach/src/robomaker is renamed to robomaker_bck?

Is this a mistake?

SageMaker network

Before running the SageMaker container you have to create a network:
docker network create sagemaker-local

Robomaker Container Size Gets Large

After running the training for extended periods of time, the robomaker container grows in size very quickly. After 8 hours it has increased 60GB. It could be easy to run out of space on smaller drives at this rate. Is there any way to optimize this?

Do we install rl_coach? If so where?

Sorry. Just the second sentence in and I'm confused already.

What does this mean?

The rl_coach code comes from https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning/rl_deepracer_robomaker_coach_gazebo

Are we supposed to download/install/clone it?

Issue when progress at 100%

Hello I have the following error when progress reach 100
An error occurred (UnrecognizedClientException) when calling the CancelSimulationJob operation: The security token included in the request is invalid.

Is it an issue or am I doing something wrong?

File "/usr/local/lib/python3.5/dist-packages/rl_coach/environments/gym_environment.py", line 448, in _take_action
self.state, self.reward, self.done, self.info = self.env.step(action)
File "/usr/local/lib/python3.5/dist-packages/gym/wrappers/time_limit.py", line 31, in step
observation, reward, done, info = self.env.step(action)
File "/app/robomaker-deepracer/simulation_ws/install/sagemaker_rl_agent/lib/python3.5/site-packages/markov/environments/deepracer_racetrack_env.py", line 567, in step
return super().step([self.steering_angle, self.speed])
File "/app/robomaker-deepracer/simulation_ws/install/sagemaker_rl_agent/lib/python3.5/site-packages/markov/environments/deepracer_racetrack_env.py", line 271, in step
self.infer_reward_state(self.steering_angle, self.speed)
File "/app/robomaker-deepracer/simulation_ws/install/sagemaker_rl_agent/lib/python3.5/site-packages/markov/environments/deepracer_racetrack_env.py", line 436, in infer_reward_state
self.finish_episode(current_progress)
File "/app/robomaker-deepracer/simulation_ws/install/sagemaker_rl_agent/lib/python3.5/site-packages/markov/environments/deepracer_racetrack_env.py", line 472, in finish_episode
self.cancel_simulation_job()
File "/app/robomaker-deepracer/simulation_ws/install/sagemaker_rl_agent/lib/python3.5/site-packages/markov/environments/deepracer_racetrack_env.py", line 519, in cancel_simulation_job
job=self.simulation_job_arn
File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the CancelSimulationJob operation: The security token included in the request is invalid.
^Croot@684b9a4f45ef:/app/robomaker-deepracer/simulation_ws# [agent-9] killing on exit

sed: can't read /changehostname.c: No such file or directory

Thanks @crr0004 for this great project!

I'm having trouble setting everything up, the first error I get is about the compilation of changehostname.c inside the container. It seems that the script sagemaker-rl-container/lib/start.sh uses an absolute path for /changehostname.c which is then not found:

https://github.com/crr0004/sagemaker-rl-container/blob/c049dc929b0445b5298d98ae758f7d6e85dccff7/lib/start.sh#L18

$ python rl_deepracer_coach_robomaker.py 
Looking for config file: /home/stwirth/.sagemaker/config.yaml
Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker
Uploading to s3://bucket/rl-deepracer-sagemaker
WARNING:sagemaker:Parameter `image_name` is specified, `toolkit`, `toolkit_version`, `framework` are going to be ignored when choosing the image.
s3.ServiceResource()
Using provided s3_client
INFO:sagemaker:Creating training-job with name: rl-deepracer-sagemaker
Starting training job
Using /home/stwirth/deepracer/container for container temp files
Using /home/stwirth/deepracer/container for container temp files
Trying to launch image: crr0004/sagemaker-rl-tensorflow:nvidia
Creating tmplzh0omx5_algo-1-jvq0t_1 ... done
Attaching to tmplzh0omx5_algo-1-jvq0t_1
algo-1-jvq0t_1  | $1 is train
algo-1-jvq0t_1  | In train start.sh
algo-1-jvq0t_1  | Current host is "algo-1-jvq0t"
algo-1-jvq0t_1  | sed: can't read /changehostname.c: No such file or directory
algo-1-jvq0t_1  | Compiling changehostname.c
algo-1-jvq0t_1  | gcc: error: /changehostname.c: No such file or directory
algo-1-jvq0t_1  | gcc: fatal error: no input files
algo-1-jvq0t_1  | compilation terminated.
algo-1-jvq0t_1  | gcc: error: /changehostname.o: No such file or directory
algo-1-jvq0t_1  | Done Compiling changehostname.c
algo-1-jvq0t_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

env.sh conflicts on OSX

Hi,

When running the source rl_coach/env.sh on a mac, the following errors are thrown:

hostname: illegal option -- i
usage: hostname [-fs] [name-of-host]
readlink: illegal option -- f
usage: readlink [-n] [file ...]

Looking around the web, readlink on OSX isn't standard (naturally). But a port of it can be installed by:

brew install coreutils

That then gives a greadlink that works in the same way as on other platforms (i.e. it supports the -f option).

I think the -i on hostname is trying to get the IP address of the local machine? On OSX that can be got with:

ipconfig getifaddr en0

But that's perhaps a little brittle as you need to specify which network interface to check. The following command is a little less brittle, but returns multiple IPs (for each interface):

ifconfig | grep "inet " | grep -Fv 127.0.0.1 | awk '{print $2}'

Or perhaps just limit the grep to give a single result:

ifconfig | grep "inet " | grep -m 1 -Fv 127.0.0.1 | awk '{print $2}'

I just tested that with my wifi + LAN on/off and it did detect the switch over correctly (i.e. always giving me an active IP). So maybe that's the solution?

Split the README apart into the wiki

After training questions

Question on doing a simulation in official AWS

After a training in the local bucket we have
model_1.pb files
model_2.pb files
model_3.pb files
1_Step.*ckpt.meta
1_Step.*ckpt.index
1_Step.*data-MMMMM
2_Step.*ckpt.meta
2_Step.*ckpt.index
2_Step.*data-MMMMM
3_Step.*ckpt.meta
3_Step.*ckpt.index
3_Step.data-MMMMM
model_metadata.json
Do we have to push model_ files?
Do we have to push the episode history 1,2,3 or

only the last one (episode 3 in the example).
the best episode ?

Physical car transfer
To transfer our model to the physical car, can we do it based on the model_..pb file ? the last one ? the best episode ?
Is there a launch script in robomaker container ?

Local eavluation
Can we evaluate the model locally ?
Is there a launch script in robomaker container ?

Support for pc without GPU

Hi, it is great to see that people are able to train deepracer model offline and save money.
Issue : My pc doesnot have external graphics card. It does have Integrated Graphics Card
I am attaching image below containing specs. I have 8gb Graphics card and i3 processor, 2.0Ghz cycles. Unfortunately i have used up all free tier privileges and would like to do things offline.

Steps to upload a locally trained model for evaluation on DeepRacer console

I tried to copy the model.pb into the model.tar.gz downloaded from DeepRacer console and re-upload it to S3. The evaluation job failed. What are the steps to upload a locally trained model for evaluation on DR console?

init() got an unexpected keyword argument 's3_client'

I do exactly what you say but when it come to run the rl_deepracer console it shows this error.

SageMaker config.yaml file instructions

To start SageMaker, the config.yaml file needs to edited.

config.yaml needs to be changed such that the directory is somewhere where you are fine with having the temp files that Sagemaker creates - relative to where ipython is run from.

Deepracer worlds doesn't seem to exist?

Can't find it anywhere but is listed as one of your repos - is it private?

Can't download custom_files from s3_bucket

Have been running into this problem when trying to start the simulation on OSX. For some reason I cant download the custom_files from the s3 bucket. Any clues?

process[agent-9]: started with pid [1220]
+ export PYTHONUNBUFFERED=1
+ PYTHONUNBUFFERED=1
+ python3 -m markov.rollout_worker
Initializing SageS3Client...
S3 bucket: bucket
S3 prefix: rl-deepracer-sagemaker
{"simapp_exception": {"version": "1.0", "date": "2019-08-14 15:29:42.100910", "function": "download_file", "message": "Exception [Could not connect to the endpoint URL: \"http://172.18.0.1:9000/bucket/custom_files/model_metadata.json\"] occcured on download file-./custom_files/model_metadata.json from s3 bucket-bucket key-custom_files/model_metadata.json", "errorCode": "401", "exceptionType": "s3_datastore.exceptions", "eventType": "user_error"}}
Could not download custom model metadata from custom_files/model_metadata.json, using defaults.
Loaded default action space.
{"simapp_exception": {"version": "1.0", "date": "2019-08-14 15:29:49.831050", "function": "download_file", "message": "Exception [Could not connect to the endpoint URL: \"http://172.18.0.1:9000/bucket/custom_files/reward.py\"] occcured on download file-./custom_files/customer_reward_function.py from s3 bucket-bucket key-custom_files/reward.py", "errorCode": "401", "exceptionType": "s3_datastore.exceptions", "eventType": "user_error"}}
{"simapp_exception": {"version": "1.0", "date": "2019-08-14 15:29:49.840151", "function": "download_customer_reward_function", "message": "Could not download the customer reward function file. Job failed!", "errorCode": "503", "exceptionType": "simulation_worker.exceptions", "eventType": "system_error"}}
NoneType
[WARN] [1565796590.379974, 0.000000]: Controller Spawner couldn't find the expected controller_manager ROS interface.
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 1220, exit code 1, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/3db0388c-bea8-11e9-8048-0242ac120003/agent-9.log].
log file: /root/.ros/log/3db0388c-bea8-11e9-8048-0242ac120003/agent-9*.log
Initiating shutdown!
================================================================================
[agent-9] killing on exit
[better_odom-8] killing on exit
[car_reset_node-7] killing on exit
[robot_state_publisher-6] killing on exit
[racecar/controller_manager-5] killing on exit
[racecar_spawn-4] killing on exit
[gazebo_gui-3] killing on exit
[gazebo-2] killing on exit
Traceback (most recent call last):
  File "/opt/ros/kinetic/lib/gazebo_ros/spawn_model", line 313, in <module>
    sm.callSpawnService()
  File "/opt/ros/kinetic/lib/gazebo_ros/spawn_model", line 271, in callSpawnService
    initial_pose, self.reference_frame, self.gazebo_namespace)
  File "/opt/ros/kinetic/lib/python2.7/dist-packages/gazebo_ros/gazebo_interface.py", line 28, in spawn_urdf_model_client
    rospy.wait_for_service(gazebo_namespace+'/spawn_urdf_model')
  File "/opt/ros/kinetic/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 159, in wait_for_service
    raise ROSInterruptException("rospy shutdown")
rospy.exceptions.ROSInterruptException: rospy shutdown
Unhandled exception in thread started by 
sys.excepthook is missing
lost sys.stderr
Traceback (most recent call last):
  File "/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/car_node.py", line 133, in <module>
    DEEPRACER = DeepRacer()
  File "/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/car_node.py", line 44, in __init__
    rospy.wait_for_service('/gazebo/set_model_state')
  File "/opt/ros/kinetic/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 159, in wait_for_service
    raise ROSInterruptException("rospy shutdown")
rospy.exceptions.ROSInterruptException: rospy shutdown
[rosout-1] killing on exit
[master] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete
done

Could not connect to the endpoint URL: \"http://127.0.0.1:9000/bucket/custom_files/model_metadata.json\"]

Hi when I run :

docker run --rm --name dr --env-file ./robomaker.env --network sagemaker-local -p 8080:5900 -it crr0004/deepracer_robomaker:console

I got below exception that it could not connect with : http://127.0.0.1:9000/bucket/custom_files/model_metadata.json; however, i already created bucket and upload custom_files there.

Error:

{"simapp_exception": {"version": "1.0", "date": "2019-09-13 01:42:18.766586", "function": "download_file", "message": "Exception [Could not connect to the endpoint URL: \"http://127.0.0.1:9000/bucket/custom_files/model_metadata.json\"] occcured on download file-./custom_files/model_metadata.json from s3 bucket-bucket key-custom_files/model_metadata.json", "errorCode": "401", "exceptionType": "s3_datastore.exceptions", "eventType": "user_error"}}
Could not download custom model metadata from custom_files/model_metadata.json, using defaults.
Loaded default action space.
{"simapp_exception": {"version": "1.0", "date": "2019-09-13 01:42:27.198120", "function": "download_file", "message": "Exception [Could not connect to the endpoint URL: \"http://127.0.0.1:9000/bucket/custom_files/reward.py\"] occcured on download file-./custom_files/customer_reward_function.py from s3 bucket-bucket key-custom_files/reward.py", "errorCode": "401", "exceptionType": "s3_datastore.exceptions", "eventType": "user_error"}}
{"simapp_exception": {"version": "1.0", "date": "2019-09-13 01:42:27.200555", "function": "download_customer_reward_function", "message": "Could not download the customer reward function file. Job failed!", "errorCode": "503", "exceptionType": "simulation_worker.exceptions", "eventType": "system_error"}}
NoneType
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 1268, exit code 1, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/aad25568-d5c7-11e9-86d4-0242ac120003/agent-9.log].
log file: /root/.ros/log/aad25568-d5c7-11e9-86d4-0242ac120003/agent-9*.log
Initiating shutdown!
================================================================================
[better_odom-8] killing on exit
[agent-9] killing on exit
[car_reset_node-7] killing on exit
[robot_state_publisher-6] killing on exit[racecar/controller_manager-5] killing on exit
[INFO] [1568338948.208172, 14.682000]: Shutting down spawner. Stopping and unloading controllers...
[INFO] [1568338948.210402, 14.682000]: Stopping all controllers...

Thanks.

Illegal Instruction - `python3 -m markov.rollout_worker`

In trying to run the deepracer_robomaker:console docker image the whole process crashes inside docker.

Tried to debug by executing bash against the docker image.

cd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation
python3 -m markov.rollout_worker
Illegal instruction (core dumped)

Host is a Ubuntu 18.04.2 LTS on a VMWare ESXi stack.
16 Cores and 64GB RAM

Docker 18.06.1-ce
Python 3.6.8

Problems accessing minio S3 from container

Running

 $ python rl_deepracer_coach_robomaker.py

I get

algo-1-rqpqu_1  | botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:9000/bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz"
algo-1-rqpqu_1  | 
algo-1-rqpqu_1  | Could not connect to the endpoint URL: "http://127.0.0.1:9000/bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz"

However, when I copy and paste that URL into my browser I can see the bucket and its content just fine. So the problem seems to be that the container cannot access the bucket. Are there any network setup steps that I need to follow?

No tensorflow reported when trying to run nvidia image for sagemaker

Steps to reproduce:
I followed instructions in the readme, but instead of docker pull nabcrr/sagemaker-rl-tensorflow:console I did docker pull nabcrr/sagemaker-rl-tensorflow:nvidia and then tagged it as instructed. Before running (cd rl_coach; ipython rl_deepracer_coach_robomaker.py) I went to that file and commented out the line that Lonon mentioned in #17

Expected result:
When running (cd rl_coach; ipython rl_deepracer_coach_robomaker.py) my gpu is detected and training begins

Actual result:

algo-1-vrm2i_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
algo-1-vrm2i_1  | Reporting training FAILURE
algo-1-vrm2i_1  | framework error:
algo-1-vrm2i_1  | Traceback (most recent call last):
algo-1-vrm2i_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 60, in train
algo-1-vrm2i_1  |     framework = importlib.import_module(framework_name)
algo-1-vrm2i_1  |   File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-vrm2i_1  |     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-vrm2i_1  |   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-vrm2i_1  |   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-vrm2i_1  |   File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
algo-1-vrm2i_1  |   File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
algo-1-vrm2i_1  |   File "<frozen importlib._bootstrap_external>", line 678, in exec_module
algo-1-vrm2i_1  |   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
algo-1-vrm2i_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_tensorflow_container/training.py", line 24, in <module>
algo-1-vrm2i_1  |     import tensorflow as tf
algo-1-vrm2i_1  | ModuleNotFoundError: No module named 'tensorflow'
algo-1-vrm2i_1  |
algo-1-vrm2i_1  | No module named 'tensorflow'

System info:
Ubuntu 18.04.2 LTS

$ docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
Mon Jun 17 22:24:56 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 660M    Off  | 00000000:01:00.0 N/A |                  N/A |
| N/A   46C    P8    N/A /  N/A |    266MiB /  1999MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

Will it work on Windows 10?

Any guides for Windows 10?

Cloning Git Repository

Having access error when trying to clone the repo.
git clone --recurse-submodules https://github.com/crr0004/deepracer.git
Cloning into 'deepracer'...
remote: Enumerating objects: 332, done.
remote: Counting objects: 100% (332/332), done.
remote: Compressing objects: 100% (157/157), done.
remote: Total 332 (delta 178), reused 319 (delta 170), pack-reused 0
Receiving objects: 100% (332/332), 203.89 KiB | 305.00 KiB/s, done.
Resolving deltas: 100% (178/178), done.
Submodule 'aws-robomaker-sample-application-deepracer' (https://github.com/crr0004/aws-robomaker-sample-application-deepracer) registered for path 'aws-robomaker-sample-application-deepracer'
Submodule 'deepracer_worlds' ([email protected]:crr0004/deepracer_worlds.git) registered for path 'deepracer_worlds'
Submodule 'sagemaker-containers' (https://github.com/crr0004/sagemaker-containers.git) registered for path 'sagemaker-containers'
Submodule 'sagemaker-python-sdk' (https://github.com/crr0004/sagemaker-python-sdk.git) registered for path 'sagemaker-python-sdk'
Submodule 'sagemaker-rl-container' (https://github.com/crr0004/sagemaker-rl-container) registered for path 'sagemaker-rl-container'
Submodule 'sagemaker-tensorflow-container' (https://github.com/crr0004/sagemaker-tensorflow-container) registered for path 'sagemaker-tensorflow-container'
Cloning into '/home/parallels/deepracer/aws-robomaker-sample-application-deepracer'...
remote: Enumerating objects: 115, done.
remote: Counting objects: 100% (115/115), done.
remote: Compressing objects: 100% (76/76), done.
remote: Total 904 (delta 59), reused 89 (delta 39), pack-reused 789
Receiving objects: 100% (904/904), 10.19 MiB | 2.36 MiB/s, done.
Resolving deltas: 100% (303/303), done.
Cloning into '/home/parallels/deepracer/deepracer_worlds'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:crr0004/deepracer_worlds.git' into submodule path '/home/parallels/deepracer/deepracer_worlds' failed
Failed to clone 'deepracer_worlds'. Retry scheduled
Cloning into '/home/parallels/deepracer/sagemaker-containers'...
remote: Enumerating objects: 1711, done.
remote: Total 1711 (delta 0), reused 0 (delta 0), pack-reused 1711
Receiving objects: 100% (1711/1711), 547.29 KiB | 435.00 KiB/s, done.
Resolving deltas: 100% (1075/1075), done.
Cloning into '/home/parallels/deepracer/sagemaker-python-sdk'...
remote: Enumerating objects: 4673, done.
remote: Total 4673 (delta 0), reused 0 (delta 0), pack-reused 4673
Receiving objects: 100% (4673/4673), 52.55 MiB | 1.98 MiB/s, done.
Resolving deltas: 100% (3312/3312), done.
Cloning into '/home/parallels/deepracer/sagemaker-rl-container'...
remote: Enumerating objects: 22, done.
remote: Counting objects: 100% (22/22), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 206 (delta 8), reused 20 (delta 8), pack-reused 184
Receiving objects: 100% (206/206), 50.78 KiB | 175.00 KiB/s, done.
Resolving deltas: 100% (90/90), done.
Cloning into '/home/parallels/deepracer/sagemaker-tensorflow-container'...
remote: Enumerating objects: 1436, done.
remote: Total 1436 (delta 0), reused 0 (delta 0), pack-reused 1436
Receiving objects: 100% (1436/1436), 13.28 MiB | 2.63 MiB/s, done.
Resolving deltas: 100% (730/730), done.
Cloning into '/home/parallels/deepracer/deepracer_worlds'...
Warning: Permanently added the RSA host key for IP address '13.229.188.59' to the list of known hosts.
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

deepracer-for-dummies setup

https://github.com/alexschultz/deepracer-for-dummies

This project sets everything up needed to run local training via docker-compose on a Linux host with Nvidia/CUDA/CUDNN installed. The goal was to simplify the setup so all that is needed is to run the init.sh script. It checks out this repo (https://github.com/crr0004/deepracer) as well as the log analysis enhancements here (https://github.com/breadcentric/aws-deepracer-workshops)

Issue after updating to Docker 19.03

After updating to docker 19.03, like described in the Nvida-docker repo, running rl_deepracer_coach_robomaker.py gets stuck at "Using provided s3_client"

Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker Uploading to s3://bucket/rl-deepracer-sagemaker WARNING:sagemaker:Parameter image_nameis specified,toolkit, toolkit_version, frameworkare going to be ignored when choosing the image. s3.ServiceResource() Using provided s3_client

Tensorboard?

Is it possible to wire up tensorboard or something similar to monitor training in real-time?

exception while downloading checkpoint

Hi,

I've just attempted to get a local training environment set up using your instructions and images however I'm getting an error in ROS when trying to start the distributed training:

Received IP from SageMaker successfully: 172.18.0.2
Using custom preset file!
Waiting for checkpoint
Got exception while downloading checkpoint An error occurred (404) when calling the HeadObject operation: Not Found
Got exception while downloading checkpoint An error occurred (404) when calling the HeadObject operation: Not Found
Got exception while downloading checkpoint An error occurred (404) when calling the HeadObject operation: Not Found
Got exception while downloading checkpoint An error occurred (404) when calling the HeadObject operation: Not Found

Minio seems to be working ok and Sagemaker is running fine... It's also sitting there spamming the same message over and over:

algo-1-rrryt_1  | 64:M 23 Apr 2019 14:46:49.741 * Ready to accept connections
algo-1-rrryt_1  | Redis server started successfully!
algo-1-rrryt_1  | Uploaded IP address information to S3: 172.18.0.4
algo-1-rrryt_1  | Downloading pretrained model into ./pretrained_checkpoint from s3://bucket/rl-deepracer-pretrained
algo-1-rrryt_1  | Checkpoint not found. Waiting...
algo-1-rrryt_1  | Checkpoint not found. Waiting...
algo-1-rrryt_1  | Checkpoint not found. Waiting...
algo-1-rrryt_1  | Checkpoint not found. Waiting...

I can't see anything relating to checkpoints in the minio bucket at all but I'm not sure what file it's actually looking for.

It looks like maybe it's looking for a pretrained model in the bucket but that doesn't seem to exist... do I somehow need to create it?

Full logs from Sagemaker + ROS:
deepracer.log
sagemaker.log

Update wiki evaluation page.

Hi,

Good day.

Kudos for this work!

Add NUMBER_OF_TRIALS to command or it won't work. Thanks.

Regards.

Redis launches with Maxmemory 0

Redis is launching with Maxmemory of 0. Which tells Redis that it has unlimited memory to use. Eventually this will cause OOM error and cause the system to stop.

This will roughly hit machines with 8 gb of ram in 6 hours, and 24 hours for machines with 16GBs. memory usuage doubles roughly every 6 hours.

workarounds consist of setting a large page file, but taking a performance hit.

currently testing the command redis set maxmemory 5gb which limits it to 5gb. for my system that only has 8gbs

General usage issues

Do not know if these issues are universal. To promote automatability, I had to manually resolve the following issues:

Github submodule links bad and had to find links and clone individually. The deepracer worlds repo does not exist.
Had to create a bucket called 'bucket' in Minio
Bad urllib requirement <1.25 and >=1.21.1 but got 1.25.1
Had to pip install pandas
Had to pip install awscli
Got permission denied until sudo chmod 666 /var/run/docker.sock
Got an UnrecognizedClientException when calling the GetAuthorizationToken operation: The security token included in the request is invalid. The docker tag command should corrected to: docker tag nabcrr/sagemaker-rl-tensorflow:coach0.11-cpu-py3 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-rl-tensorflow:coach0.11-cpu-py3
No instructions for config.yaml file for SageMaker as described in #3

This is great - a suggestion

Could I suggest that you integrate this with https://github.com/localstack/localstack, you'll get s3 and cloudwatch, lambda, etc. You'll also be adding sagemaker and robomaker to the project. Seems like a win/win!

When pretrained model is not found, sagemaker falls into an infinite silent loop

I mistakenly put my model in pretrained folder but outside the model subfolder. In such case an exception is caught silently and then a sleep is called only to retry the exact behaviour.
While I did not fix the issue, I added logging to make it verbose. I will try to upload a patch.

How to evaluate the model trained locally?

Can we run the evaluation locally or we need to upload the model back to DeepRacer console to evaluate it?

Starting position and yaw is not random

I find it incredibly difficult to train the car to only stay in the left lane. So I thought by modifying the reset_car function, it would help it start in new random spots and positions. However I seem to be modifying the wrong spot. Any clue where to plug this in to get random starting poses?

def getFuturePoint(x, y, radius, heading_radians):

        result = np.zeros(2)
        result[0] = radius * math.cos(heading_radians) + x
        result[1] = radius * math.sin(heading_radians) + y
        return result

    def reset_car(self, ndist, next_index):
        ''' Reset's the car on the track
            ndist - normalized track distance
            next_index - index of the next way point
        '''
        # Compute the starting position and heading
        start_point = self.center_line.interpolate(ndist, normalized=True)
        start_yaw = math.atan2(self.center_line.coords[next_index][1] - start_point.y,
                               self.center_line.coords[next_index][0] - start_point.x)
        random_angle = math.radians(np.random.normal(0.0,10.0))
        original_start_yaw = start_yaw
        start_yaw += random_angle 
        print("SOME RANDOM STARTING SECRET SAUCE")
        random_radius = np.random.normal(-0.02,0.07)
        offset_radians = original_start_yaw + math.radians(90)
        if random_radius > 0:
            offset_radians = original_start_yaw + math.radians(-90)
        random_radius = abs(random_radius)
        new_starting_point = getFuturePoint(start_point.x, start_point.y, random_radius, offset_radians)
        if start_yaw > math.pi:
            start_yaw -= math.pi * 2
        elif start_yaw <= math.pi:
            start_yaw += math.pi * 2
        start_quaternion = Rotation.from_euler('zyx', [start_yaw, 0, 0]).as_quat()

        # Construct the model state and send to Gazebo
        model_state = ModelState()
        model_state.model_name = 'racecar'
        model_state.pose.position.x = new_starting_point[0]
        model_state.pose.position.y = new_starting_point[1]

Update the wiki

As suggested in #28 I would like to suggest some modifications to the wiki. Unfortunately I cannot fork the wiki. While I could clone it, there is no way to raise a PR for it. Could you then add the following to the wiki pages?

Hyper parameters:

To override default hyperparameters, in `rl_deepracer_coach_robomaker.py` uncomment the fields that you wish to override and set their value.

Default values can be found in `sagemaker_graph_manager.py`

In Retraining a Model replace the first sentence with

To train based on an existing model, uncomment options
`pretrained_s3_bucket` and `pretrained_s3_prefix` in `rl_coach/rl_deepracer_coach_robomaker.py`.