Description
For my first foray into reinforcement learning, I decided to try to solve the Peg solitaire puzzle. I have to admit that this has been a struggle, but in the end, I have come up with something that:
- works most of the time
- it doesn't always converge to a successful result
- is pretty simple
- I use a one-step temporal difference (TD) approach with a small neural network
However, I am still in the early stages of studying reinforcement learning so I shall continue to try to discover better ways to apply reinforcement learning to this problem.
Brute Force Approach
As a baseline, I wrote a naive bit of code to find a solution by brute force using a depth-first search - see bruteForce.mjs. This code is entirely deterministic. It finds the first solution after 3,730,510 episodes. On my MacBook Pro (Mid 2014), this takes about 5 minutes:
$ node console/bruteForce.mjs
...
solution: [7,13,0,4,20,11,22,29,32,0,30,32,40,3,45,40,66,47,50,5,35,58,8,65,57,74,41,72,74,63,68]
episodes: 3730510
elapsed time: 04:59
Reinforcement Learning Approach
Environment
I have implemented a SolitaireEnv
class with the following methods/getters:
- reset()
- returns:
observation
- returns:
- step(action)
- returns: [
observation
, reward, done]
- returns: [
- validActions()
- returns: array of valid actions
- done
- returns:
true
if no further moves are possible elsefalse
- returns:
- solved
- returns:
true
if the puzzle is solved elsefalse
- returns:
Actions
There are a total of 76 actions but only a subset of these are valid in any given state. For example, in the initial state, the following actions are valid:
- action 7 from 1,3 to 3,3 via 2,3
- action 31 from 3,1 to 3,3 via 3,2
- action 44 from 3,5 to 3,3 via 3,4
- action 68 from 5,3 to 3,3 via 4,3
Observation
An observation
is an array of 33 0
s/1
s - the board locations in reading order (left to right, top to bottom)
where a 0
represents an empty location and a 1
represents an occupied location.
- initial state:
[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
- solved state:
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Reward
The value of the reward returned by step
is as follows:
- when the episode is not done yet:
- 0
- when the episode is done:
- the negative sum of manhattan distances (from the centre of the board) of each remaining piece
Thus, the biggest total reward at the end of an episode is 0 (when the puzzle is solved).
My thinking, regarding the design of the reward signal, is that we want to encourage:
- fewer remaining pieces
- remaining pieces to be closer to the centre
Neural Network
There are too many states for us to use a tabular implementation. Therefore, we will use a neural network to approximate the state value function.
The architecture of the neural network is as follows:
_________________________________________________________________
Layer (type) Output shape Param #
=================================================================
input-layer (Dense) [null,10] 340
_________________________________________________________________
output-layer (Dense) [null,1] 11
=================================================================
Total params: 351
Trainable params: 351
Non-trainable params: 0
_________________________________________________________________
There is one input value for each board location. There is a single output value which is interpreted as a state value.
Policy
The policy function takes a state and epsilon and returns an action. It uses an epsilon-greedy algorithm:
- with
epsilon
probability, it chooses a random valid action - with
1 - epsilon
probability, it chooses the best valid action
To choose the best valid action, it uses the neural network to predict the value of the next state that each valid action leads to. It then chooses the action that corresponds to the biggest next state value.
Training Loop
- Loop for
MAX_EPISODES
after which give up - For each episode:
- Reset the environment
- Keep stepping until the episode is done (no further moves are possible)
- For each step:
- Use the policy to choose an action
epsilon
decays fromEPSILON_START
toEPSILON_END
overEPSILON_DECAY_EPISODES
- Step the environment using the chosen action returning the new state, a reward and a done flag
- Calculate a new target value for the current state:
- If the episode is not done, target = reward + GAMMA * value of next state
- If the episode is done, target = reward
- Calculate the loss = MSE of difference between the current value of the current state and the new target value of the current state
- Use the optimizer to back propagate the loss
- Use the policy to choose an action
- Check whether to stop training (see next section)
Deciding when to stop training
I guess it depends on what we are trying to achieve.
If we just want to find a solution, then we can stop training as soon as solved
is true
.
If we keep track of the actions taken at each step, then will have our solution.
In this mode, training always seems to succeed. Here are some typical results:
Elapsed | Episodes | Successful ? |
---|---|---|
01:27 | 5099 | Yes |
01:28 | 5165 | Yes |
01:19 | 4551 | Yes |
00:56 | 3310 | Yes |
01:11 | 4164 | Yes |
Or, we may want to end up with a trained model that we can use to solve the puzzle when following a 100% greedy policy i.e. no random moves. In this mode, training doesn't always succeed. Here are some typical results:
Elapsed | Episodes | Successful ? |
---|---|---|
01:25 | 3730 | Yes |
01:01 | 2769 | Yes |
01:50 | 4776 | Yes |
02:32 | 6554 | Yes |
01:16 | 3374 | Yes |
These results compare favourably with the baseline results given above.
Running Locally
PUBLIC_URL=. npm start
Screenshots
Manual Play
Click a piece to select it. Only pieces with valid moves are selectable. When a piece is selected, it is highlighted along with its valid destinations. To actually make a move, click a valid destination.
Agent Play
Successful Training Run
Note that showing the chart during training slows down training considerably.
TODO
- Redraw board on window size change
- Redraw board on window orientation change
- Training view functionality:
- Add UI to allow configuration of hyperparameters
- Allow trained models to be saved (uploaded to server)
- Allow trained models to be given a name
- Store date/time too
- Store hyperparameters too
- Store training statistics too
- Agent Play view:
- Allow a named trained model to be chosen (once we have implemented saving of named trained models)
- Training algorithm improvements:
- Try improve the speed of training
- Try to improve the robustness of training
- Can we exploit the symmetry of the game ?
- Would be nice to produce a trained model that can handle a randomly chosen first move (not the case currently)
- Consider adding support for other board variants
- Would it be possible to train a single model that can handle different board variants ?
- At the very least, we would probably need to use a CNN to handle the different inputs