Latest Update: 2022-11-29
Disclaimer: At the time of writing, I am employed by Amazon as an SSA (Specialist Solutions Architect) in Beijing, China. However, the notes, opinions, and thoughts on DeepRacer modeling shared here are my own, not those of my employer. In cases where I borrow ideas or methods from other people, I try to make that clear with appropriate references.
The idea to keep a DeepRacer journal comes from Scott Pletcher's awesome repo, where he documents some of the reward functions he has tried and the logic behind them.
I started by taking the (free) AWS DeepRacer: Driven By Reinforcement Learning course on AWS Skillbuilder.
The course runs through the basics of setting up the physical DeepRacer car, using the DeepRacer console, and training and evaluating a model. It also explains what the model's hyperparameters are, and what parameters you can use in the reward function to reward (or punish) your car for taking certain actions.
During the course, I used a few of the example reward functions provided in the DeepRacer documentation and trained the car on the Jennens Family Speedway track, typically for 1 hour at a time.
After playing with some of the provided functions, I learned that:
- Complicated reward functions aren't always better (in fact, making a lot of assumptions about what the car "should" be doing seems to be a bad thing)
- Some reward functions can be trained more quickly than others (in general, following the centerline leads to fast convergence)
With these observations in mind, I opted to start with simpler reward functions, only adding new pieces as necessary to coax the model towards desired behaviors it had not acquired on its own.
Model | Purpose |
---|---|
Follow the line | Just follow the centerline |
Don't wiggle | Follow the centerline, but penalize the car for steering angles > 15 degrees |
Stay on track | Reward the car based on its ability to stay on the track and reasonably close to the centerline |
All three of these models were trained on the Jennens Family Speedway track: none of them achieved a time under 1 minute, and only the Don't Wiggle model managed to make it around the track without any resets.
The obvious next step was to try and get the car moving faster. To do this, I cloned my three models again, but with a modified reward function that penalized low speeds. Specifically, I penalized the car for speeds below 1 m/s, by reducing the reward by a factor of 0.8
(for the Don't wiggle model, the reward was reduced even further if the car's steering angle was > 15 degrees). I also updated the maximum allowed speed in the Action space settings, from 1 m/s to 2 m/s.
Model | Purpose |
---|---|
Follow the line, fast | Same as Follow the line, but with a low speed penalty |
Don't wiggle, fast | Same as before (over-steering penalty), but with a low speed penalty added as well |
Stay on track, fast | Same as before, but with a low speed penalty |
The results were good: the car did speed up, and most models could complete the track in under a minute (under 40 seconds, in some cases).
Rather than punishing low speeds, I cloned my models again and tried directly adding a scaling factor to the reward, which scaled up as the car traveled faster. I also raised the car's maximum speed to 3 m/s.
The scaling factor looked something like this:
# Give a bonus for high speeds
reward += speed / 3.0
I was expecting good results, so I was surprised when the models performed badly. Instead of speeding around the track, the models were flying off it! Perhaps the reward for speed was overwhelming the other rewards for staying on the track and/or staying near the center.
I ended up throwing these models away to go back to the drawing board.
Since all the reward functions seemed to get better at navigating the track as I trained them more, I started to wonder: would any reward function work? What about a constant reward of 1? or -1? What if I just fed the model a 0? What if I set the discount factor (a measure of how important future rewards are) to 0? I ran a few tests to try and tease out what would happen.
My previous experience told me that simpler models need more time to train, so I gave each of the models below a full 4 hours of training time.
Model | Purpose |
---|---|
Good dog | Constant reward of 1, regardless of what the car does |
Bad dog | Constant reward of -1, regardless of what the car does |
Existentialist dog | No reward of any kind, regardless of what the car does |
In the moment | Default centerline following model, but with discount factor set to 0 |