GithubHelp home page GithubHelp logo

official-stockfish / nnue-pytorch Goto Github PK

View Code? Open in Web Editor NEW
284.0 284.0 94.0 2.04 MB

Stockfish NNUE (Chess evaluation) trainer in Pytorch

License: GNU General Public License v3.0

Python 47.91% CMake 0.10% Batchfile 0.26% C++ 51.30% Shell 0.42%
chess deep-learning pytorch stockfish

nnue-pytorch's People

Contributors

cglemon avatar cj5716 avatar ddobbelaere avatar dhbloo avatar disservin avatar fauziakram avatar glinscott avatar kennyfrc avatar linrock avatar naphthalin avatar niklasf avatar nodchip avatar nomoras avatar sergiovieri avatar sopel97 avatar tomalard avatar uniqp avatar vondele avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nnue-pytorch's Issues

training data format

where can I find documentation for the binary format of training data ? my understanding is that it's just a bunch of (FEN, Result, Eval?) records, but binary compressed in some way (that I have yet to understand).

I would like to change c-chess-cli's export format to produce it directly:
https://github.com/lucasart/c-chess-cli#sampling

UserWarning: this overload of addcmul_ is deprecated

I'm using Google Colab to train some nets. However, I get the following warning:

/content/drive/My Drive/nnue-pytorch/ranger.py:136: UserWarning: This overload of addcmul_ is deprecated:
	addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
	addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

It is not a big problem since I can train anyway, but maybe updating addcmul_ should be considered.

Is the training setup for the master network here

I don't know where to ask. this seems best related. So main question above. motivation below. source code last resort for documentation, but might be the less noisy too. problem is where. but if below questions can get answer before the main one, That would be great.

Also, any documentation about the training setup of master, if SL or RL, and what kind of policy used for self-learning, if that is what RL means for the master NNue (can it be also called NNue). if no wiki on the master training repo.

I have a lot of non programming questions about what is actually done by the code at a mathematical or algorithmic level.
If the master is using RL, then, leaving aside all the speed up goodies of NNue, and focusing on input-output, self-game batch exploration process, and policy involvement. are also my concern. I want to compare the master to lc0 not from an engine competition but from a machine learning framework point of view.

if SL then the policy design question become database construction or selection and sub-sampling, and target definition for the global loss function. there question are independent of how implemented the training is. And that is why I wish to avoid source code as documentation. if possible.

I am asking here, because for now, on github, in the SF repo. the 2020, SF12 first code change introducing NNue, has existence of the master net as premise. and focused on the transformers. and their training with the master as "oracle". for positions database, huge. but not characterized there. ( i also would like to know about those, are they related to the master training setup?). Also, there is talk there of high imbalance positions, so that i conclude the training is not pre-selecting material neutral position in the database setup.

Can anyone course correct my inquiry line. here. Thanks you much in advance.

Less important question, but asking, in case it is easy to clarify.
also, if NNue can also do high imbalance in material. why are they only used for neutral one. i thought it was an accuracy versus speed compromise, because of the cost of NNue evaluation per position compared to classic static eval. But the final eval printout in debug mode, puzzles me. as it does not seem to be either the NNue or the classical score printout. with tree search of not.

Training on a system with no GPU

Hello,

Thank you for creating a nice project for nnue training in pytorch!

I am trying to use your project to create a network for Igel. I wanted to ask you if it is possible to have trainer in "CPU mode only" as I am renting some "bare metal hardware" which has powerful CPUs, but no GPU is present. When I run the trainer on Ubuntu 20.04 I get this:

python train.py --smart-fen-skipping --random-fen-skipping 10 --batch-size 16384 --threads 8 /home/volodymyr/training/sharpen_data/total_759m_d12.bin /home/volodymyr/training/total_30m_d14.bin
Feature set: HalfKP^
Num real features: 41024
Num virtual features: 704
Num features: 41728
Training with /home/volodymyr/training/sharpen_data/total_759m_d12.bin validating with /home/volodymyr/training/total_30m_d14.bin
Global seed set to 42
Seed 42
Using batch size 16384
Smart fen skipping: True
Random fen skipping: 10
limiting torch to 8 threads.
Using log dir logs/
/home/volodymyr/nnue-pytorch/env/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
warnings.warn(*args, **kwargs)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Traceback (most recent call last):
File "train.py", line 105, in
main()
File "train.py", line 93, in main
main_device = trainer.root_device if trainer.root_gpu is None else 'cuda:' + str(trainer.root_gpu)
AttributeError: 'Trainer' object has no attribute 'root_device'

On some discussions on talkchess I saw some people were saying they managed to use the trainer in CPU mode, but no specifics on how.

I thought my request is common enough as other people may end up in the same situation as well, please let me know if it is possible.

Thank you very much,
With best regards,
Volodymyr

Detect vanishing gradients

It might be desirable to monitor/detect vanishing gradients during training. Note that I of course mean "stochastic gradient" here, as estimated by the training samples used in the current epoch (maybe the current batch size is too small to excite all king/piece positions, so preferably the mean or max abs over a window of multiple epochs).

This would have detected the anomalies in the input layer (dead weights for some king positions) in vondele's run84run3, see #53.

Note that with GC (gradient centralization), we cannot resort to investigating a mere difference of two checkpoints, as the centralized gradient by definition contains a contribution equal to the mean of all gradient vectors over all neurons of a layer (see equation (1) of https://arxiv.org/pdf/2004.01461v2).

As a "work-around", continued training without GC (use_gc=False in Ranger) on a checkpoint and then comparing/visualizing the difference between a later checkpoint should also do the trick I think.

See also https://discuss.pytorch.org/t/how-to-check-for-vanishing-exploding-gradients/9019

Idea for reducing quantization error

This evening I had some fun improving the visualizer. I'll try to create a PR if I find the time for it.

One of the things I investigated are the weights of the fully-connected layers after the input layer. Also, as all hidden neurons are "arbitrarily ordered" (a random permutation leads to an equivalent net), I ordered the input weights by their L1-norm (sum of absolute values).

Here are the plots for master net nn-62ef826d1a6d.nnue (an offspring from sergioviero's run) and nn-0f63c1539914.nnue (latest vdv net on fishtest).

master
master_fc
master_hist
vdv
vdv_fc
vdv_hist

Here's the main "observation": the (quantized) FC1 weights are sharply peaked (even more so for vdv net than for master) around zero. This is not the case for FC2 weights (not shown here, but already "visually apparent" from other figure). I suspect that the quantization error leads to some ELO loss (how much is the question :)).

Would it be a good idea to rescale the input weights such that the FC1 weights become higher in magnitude, such that their quantization error decreases? Denote the output of neuron j at layer i as $x^{(i)}_j$, the bias term $b_j$ and its input weights $a_k$. With a non-linear activation function $sigma$, this leads to the familiar equation:

image

If sigma is a clamped ReLU and q_j>0, it holds that

image

if and only if sigma "is not clamping" (to the max. saturation value).

Maybe it's an idea to multiply the input weights and bias of neuron j with a factor q_j<1 and at the same time divide the FC1 weights connected with this neuron with q_j. The resulting unquantized net performance will be equivalent if no neuron clamping/saturation occurs, but the quantized net might have beter performance.

Some clarifications:

  • The shown FC1 weights are 32×512 in size.
  • There is a q_j scaling factor we can "play with" for each (own, opponent) pair of columns (so 256 degrees of freedom in total).
  • It is visually apparent from the "vertical lines" that some FC1 weights already have an optimal dynamic range (so no real or only small q_j correction needed), but most have not.
  • This rescaling (in the other direction, so for q_j > 1) might also be a solution to prevent FC1 weights clipping.

So, in summary, I argue for a dynamic rescaling of both the input and FC1 weights such that the relative quantization error of the latter weights dramatically decreases (actually, the total relative quantization errors, also taking into account input weights, might be even better). This rescaling is "exact" (leads to an equivalent unquantized net) if no input layer neuron clamping occurs during play. It is expected that quantized net performance will be (much?) better in that case.

train_data.bin?

To start with, I know this is a very starter-y question. But when I run the program it asks for train and val data, how/where do I get these?

error on startup: Trainer.root_gpu was deprecated

I have followed the setup steps in the README, and installed CuPy version 11.7. I am getting this error on startup:


Traceback (most recent call last):
File "train.py", line 154, in <module>
    main()
  File "train.py", line 129, in main
    main_device = trainer.root_device if trainer.root_gpu is None else 'cuda:' + str(trainer.root_gpu)
File "/home/jdart/.local/lib/python3.8/site-packages/pytorch_lightning/_graveyard/trainer.py", line 53, in _root_gpu
    raise AttributeError(
AttributeError: `Trainer.root_gpu` was deprecated in v1.6 and is no longer accessible as of v1.8. Please use `Trainer.strategy.root_device.index` instead.

System is Ubuntu 20.04

Pretrained checkpoints

Are there any pre-trained checkpoints available?
Preferably ones used by official stockfish release?

All the best,
Aðalsteinn

Piece heat maps discrepancies

I did some research on so-called "piece heat maps" (based on a hacked visualizer branch) and found something interesting. Define a piece heat map as the sum of all absolute values of all weights in the input layer that are connected to some (piece position, king position) pair.

Then we get the following heat maps for master/latest vdv on fishtest:

master
vdv

Note the following two observations are much more outspoken for master:

  1. Few "energy" for own piece where own king is located (left side).
  2. Few "energy" for other pieces where own king can be put in check (only from adjacent squares for bishops, rooks and queens)

I suppose 2. is due to training data filtering (to yield only "quiet" positions where king is not in check)? But why is it so less outspoken on vdv net?

Changes to model and compatibility with Stockfish

Seeking advice

I'm experimenting with various NNUE architecture, e.g. without psqt and layer stack buckets, and another feature set than is currently implemented in Stockfish. But leaving the layer stack untouched, only changing the hyperparameters.

The goal is to benchmark it with c-chess-cli (similarly as is done in run_games.py)
I'm wondering if I'm missing some low hanging fruit way of doing that, other than doing considerable amount of changes to Stockfish

All the best,

cdecl warning

On Linux I get these warning, obviously I can define CDECL to an empty string, but I'm not sure if that's correct, or on which systems we need the cdecl. Is that WIN32_ only ?

[ 50%] Building CXX object CMakeFiles/training_data_loader.dir/training_data_loader.cpp.o
/home/vondele/chess/vondele/nnue-pytorch/training_data_loader.cpp:361:159: warning: ‘cdecl’ attribute ignored [-Wattributes]
  361 |     EXPORT Stream<SparseBatch>* CDECL create_sparse_batch_stream(const char* feature_set_c, int concurrency, const char* filename, int batch_size, bool cyclic)
      |                                                                                                                                                               ^
/home/vondele/chess/vondele/nnue-pytorch/training_data_loader.cpp:376:78: warning: ‘cdecl’ attribute ignored [-Wattributes]
  376 |     EXPORT void CDECL destroy_sparse_batch_stream(Stream<SparseBatch>* stream)
      |                                                                              ^
/home/vondele/chess/vondele/nnue-pytorch/training_data_loader.cpp:381:82: warning: ‘cdecl’ attribute ignored [-Wattributes]
  381 |     EXPORT SparseBatch* CDECL fetch_next_sparse_batch(Stream<SparseBatch>* stream)
      |                                                                                  ^
/home/vondele/chess/vondele/nnue-pytorch/training_data_loader.cpp:386:58: warning: ‘cdecl’ attribute ignored [-Wattributes]
  386 |     EXPORT void CDECL destroy_sparse_batch(SparseBatch* e)
      |                                                          ^
[100%] Linking CXX shared library libtraining_data_loader.so

"RuntimeError: shape '[41728, 256]' is invalid for input of size 10510996" when converting SF network

Hello,

Following example from the main page I am converting the last strongest SF net into the model using the command:

python serialize.py nn-62ef826d1a6d.nnue nn-62ef826d1a6d.pt

and it is failing with:

(env) C:\Users\vshcherbyna\Documents\Tools\other\nnue-pytorch>python serialize.py nn-62ef826d1a6d.nnue nn-62ef826d1a6d.pt
Converting nn-62ef826d1a6d.nnue to nn-62ef826d1a6d.pt
Traceback (most recent call last):
  File "C:\Users\vshcherbyna\Documents\Tools\other\nnue-pytorch\serialize.py", line 187, in <module>
    main()
  File "C:\Users\vshcherbyna\Documents\Tools\other\nnue-pytorch\serialize.py", line 181, in main
    reader = NNUEReader(f, feature_set)
  File "C:\Users\vshcherbyna\Documents\Tools\other\nnue-pytorch\serialize.py", line 112, in __init__
    self.read_feature_transformer(self.model.input)
  File "C:\Users\vshcherbyna\Documents\Tools\other\nnue-pytorch\serialize.py", line 133, in read_feature_transformer
    weights = self.tensor(numpy.int16, layer.weight.shape[::-1])
  File "C:\Users\vshcherbyna\Documents\Tools\other\nnue-pytorch\serialize.py", line 127, in tensor
    d = d.reshape(shape)
RuntimeError: shape '[41728, 256]' is invalid for input of size 10510996

Any clues on the error?
Thank you,
With best regards,
Volodymyr

Where to find training data?

This is not really an issue, but still information that I have yet to find.
Where can I find suggestions on training data (train_data.bin val_data.bin) to train the model?

All the best,

Weird discrepancy in layer stats between this and nodchip learners.

I started adding some layer statistic to logging, and while doing this I noticed that the values are quite weird. Possible causes that I have no idea about right now would be 1. different quantization. 2. different initialization

For example these are the logs from the pytorch trainer and nodchip trainer (both near the beginning of the training):

layer_0:avg_abs_weight tensor(0.0029, device='cuda:0') -- nodchip 0.0136633 WAT
layer_0:avg_abs_bias tensor(0.0391, device='cuda:0') -- nodchip 0.363736 WAT
layer_0:clipped_pct tensor(0.2302, device='cuda:0') -- nodchip 0.4696 WAT
layer_1:avg_abs_weight tensor(0.0323, device='cuda:0') -- nodchip 0.0465173 OK
layer_1:avg_abs_bias tensor(0.0420, device='cuda:0') -- nodchip 0.488686 WAT
layer_2:clipped_pct tensor(0.3352, device='cuda:0') -- nodchip 0.6184 WAT
layer_3:avg_abs_weight tensor(0.1148, device='cuda:0') -- nodchip 0.161746 OK?
layer_3:avg_abs_bias tensor(0.0888, device='cuda:0') -- nodchip 0.520152 WAT
layer_4:clipped_pct tensor(0.2703, device='cuda:0') -- nodchip 0.8391 WAT
layer_5:avg_abs_weight tensor(0.2141, device='cuda:0') -- nodchip 0.442272 WAT
layer_5:avg_abs_bias tensor(0.0650, device='cuda:0') -- nodchip 1.21253 this varies like crazy anyway
Epoch 0:   8%|▌      | 999/12331 [02:51<32:24,  5.83it/s, loss=0.019, v_num=17]C

the values in both trainers are reproducible with very small variance over runs (apart from output bias)

Any possible explanations?

Stuff that showed good results for me but is not in master

Remove the python data loader?

Maybe we should consider removing the python data loader? It's not usable in practical terms, and drags behind the need of updating it along the c++ data loader.

Add license to repository

  • I would like to fork the repository and modify some code for my own experiments. Ideally, I would like to put the result of these experiments in a different Github repository.
  • Please add a license to make this valid. I suggest to use the GPL 3.0 license as for Stockfish.
  • Reason: "If a repository has no license, then all rights are reserved and it is not Open Source or Free. You cannot modify or redistribute this code without explicit permission from the copyright holder."

Exception when converting from .nnue to .pt with serialize.py "Exception: Expected: 7af32f20, got 6d74683c"

I have tried to convert the two most recent master nets, "nn-3475407dc199.nnue" and "nn-190f102a22c3.nnue" into .pt files with the two following commands:

 python3 serialize.py nn-3475407dc199.nnue nn-3475407dc199.pt --features=HalfKAv2

and

 python3 serialize.py nn-3475407dc199.nnue nn-3475407dc199.pt --features=HalfKAv2^

both failed with the following error:

Converting nn-3475407dc199.nnue to nn-3475407dc199.pt
Traceback (most recent call last):
  File "serialize.py", line 242, in <module>
    main()
  File "serialize.py", line 225, in main
    reader = NNUEReader(f, feature_set)
  File "serialize.py", line 139, in __init__
    self.read_header(feature_set, fc_hash)
  File "serialize.py", line 158, in read_header
    self.read_int32(VERSION) # version
  File "serialize.py", line 203, in read_int32
    raise Exception("Expected: %x, got %x" % (expected, v))
Exception: Expected: 7af32f20, got 6d74683c

When I try to convert .nnue files to .pt generated on my local machine, it is successful with --features=HalfKAv2. Here is a link to the .nnue file that worked for me: https://drive.google.com/file/d/1qBO_nEZEaucLlvdrhBkibPFbKv5ifHHR/view?usp=sharing

Option to stop training

Is there an option to automatically stop the training when the val loss of the last 5 or so epoch counter is no longer improving?

small syntax change in c-chess-cli

You just need to correct this line:
https://github.com/glinscott/nnue-pytorch/blob/master/run_games.py#L76

  • old syntax: -resign 3 700 -draw 8 10
  • new syntax: -resign count=3 score=700 -draw count=8 score=10

Note that you can use number=N in both resign and draw adjudication rules. For example -draw number=40 count=8 score=10 adds the additional constraint that draw adjudication can't be declared before 40 moves were played. Replicates cute-chess-cli behavior (extending it to -resign while there).

Question about step in the console output

Input data

Training pos: 500m depth 5
Validation pos: 1m depth 10

Command line

python train.py --flush_logs_every_n_steps 1 --max_epochs 200 --smart-fen-skipping --batch-size 8192 --threads 4 --num-workers 4 --gpu 1 ncsf_2021-02_pos500000000_d5_train_tn20.binpack ncsf_2021-02_pos1000000_d10_val_tn10.binpack

Console output

Epoch 1:  39%|█████████▍              | 4846/12331 [05:59<09:15, 13.47it/s, loss=0.0136, v_num=1]

There are a total of 12331 steps, how this step is related to training pos of 500m less skipped positions from --smart-fen-skipping flag?

Support final LR annealing phase

See also official-stockfish/Stockfish#3274

The optimizer Ranger states the following in its documentation:

Best training results - use a 75% flat lr, then step down and run lower lr for 25%, or cosine descend last 25%.

Per extensive testing - It's important to note that simply running one learning rate the entire time will not produce optimal results.
Effectively Ranger will end up 'hovering' around the optimal zone, but can't descend into it unless it has some additional run time at a lower rate to drop down into the optimal valley.

Adding support for this to the trainer seems very interesting. See https://github.com/lessw2020/Ranger-Mish-ImageWoof-5/blob/b0aa73508870de072329d058f0add165da462d6d/train.py#L56 for a sample implementation.

How to make the trainer work on CPU only mode now?

I found out that #87 managed to run the trainer on cpu only mode.
But as I notice that at some point 0764091 a custom kernel has been applied to the feature transformer.
So It means more changes need to be applied to make it work on cpu side.

May I ask what should I do to the rawKernel code and other stuff to make it work on cpu?

Deprecation Warnings

I get the following deprecation warnings with the Pytorch nightly version. However, they seem to come from Pytorch Lightning, so not sure if the Pytorch version matters. I don't really understand the difference anyway. :D

[...]/env/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:396: LightningDeprecationWarning: Argument `period` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5. Please use `every_n_val_epochs` instead.
rank_zero_deprecation(
[...]/env/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:338: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
rank_zero_warn(
[...]/env/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:610: LightningDeprecationWarning: Relying on `self.log('val_loss', ...)` to set the ModelCheckpoint monitor is deprecated in v1.2 and will be removed in v1.4. Please, create your own `mc = ModelCheckpoint(monitor='your_monitor')` and use it as `Trainer(callbacks=[mc])`.

model.py issue

Do L1, L2, and L3 have any effect on the network? Can these values be modified at will?

TODO for nnue.md

  • some results regarding net size (https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.p7t1qfux64dc exps 84, 85, 86, 89)
  • about data used to train the nets (filtering, mirroring and stuff)
  • fix header links in doc
  • make headings understandable without context
  • consistent formatting (for feature set names, net arch names, etc...)
  • backprop math
  • linear layer with block sparse outputs forward (no results yet but it's an interesting direction)
  • linear layer with sparse inputs forward (put some data about sparsity after CReLU)
  • ways to reduce the network size with Half* (less king buckets, perspective mirror)
  • either write something about weight pruning or remove the mention that it will be described later in the document
  • revise implementation for sparse input FC (pad nnz indices so that only one loop is needed. use maddubs -> madd(ones, ...) implementation that processes 4 inputs at a time, with 8 bit weights). See https://pastebin.com/zDa7PbJh
  • a better nnz indices implementation for the above, see syzygy1/Cfish#204

Problem when trying to run train.py

When I try to run current train.py with python train.py 10m_d3_2.binpack d8_100000.binpack --num-workers=3 --threads=1 --batch-size=512 I get the following stack trace:

Traceback (most recent call last):
  File "train.py", line 73, in <module>
    main()
  File "train.py", line 68, in main
    checkpoint_callback = pl.callbacks.ModelCheckpoint(save_last=True)
  File "C:\Programy\Python36\lib\site-packages\pytorch_lightning\callbacks\model
_checkpoint.py", line 173, in __init__
    self.__validate_init_configuration()
  File "C:\Programy\Python36\lib\site-packages\pytorch_lightning\callbacks\model
_checkpoint.py", line 251, in __validate_init_configuration
    'ModelCheckpoint(save_last=True, monitor=None) is not a valid configuration.
'
pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoin
t(save_last=True, monitor=None) is not a valid configuration. You can save the l
ast checkpoint with ModelCheckpoint(save_top_k=None, monitor=None)

train.py issue

python train.py --smart-fen-skipping --random-fen-skipping 3 --batch-size 16384 --threads 2 --num-workers 2 --gpus 1 trainingdata validationdata
Traceback (most recent call last):
File "C:\Users\bsryc\Desktop\nnue-pytorch-master\train.py", line 149, in
main()
File "C:\Users\bsryc\Desktop\nnue-pytorch-master\train.py", line 63, in main
raise Exception('{0} does not exist'.format(args.train))
Exception: trainingdata does not exist

Slow speed on NVIDIA Quadro RTX 4000 using latest commit (3532d8de4f8c3559c7a6f123a1b21bab69a66d63)

Hello there,

Using nnue-pytorch to train networks on old architecture (halfkp_256x2-32-32) I used to see speeds of roughly 30 iterations per second on my NVIDIA Quadro RTX 4000.

Using the same machine/GPU after upgrading to latest commit on nnue-pytorch (3532d8d) I see speeds of 2.6 iterations per second:

image

Can you please tell if this is expected? When I start the training I see a big spike of GPU usage for a few seconds (30-100%) but then it calms down.

I am running Windows 11 x64 with NVIDIA Quadro RTX 4000 and latest CUDA toolkit.

I made sure I am not using CPU training by modifying the following line in train.py file:

main_device = 'cuda:' + str(trainer.strategy.root_device.index)

I run training in the following way:

python train.py --smart-fen-skipping --random-fen-skipping 3 --batch-size 16384 --threads 2 --num-workers 2 --gpus 1 "data.binpack" "val.binpack"

I tried various parameters for threads or num-workers it does not change a thing.

Thank you for any hint/help on the issue!

Best regards, Volodymyr.

RuntimeError: Pinned memory requires CUDA

Hello,

Thanks for very interesting project and contributing to NNUE training.

I am trying to use the trainer for Igel and when running the test command:

python train.py total_3m_d14.bin total_3m_d16_nnue.bin --lambda 1.0 --val_check_interval 2000 --threads 2 --batch-size 16384 --progress_bar_refresh_rate 20

I got an error:

RuntimeError: Pinned memory requires CUDA. PyTorch splits its backend into two shared libraries: a CPU library and a CUDA library; this error has occurred because you are trying to use some CUDA functionality, but the CUDA library has not been loaded by the dynamic linker for some reason.  The CUDA library MUST be loaded, EVEN IF you don't directly use any symbols from the CUDA library! One common culprit is a lack of -INCLUDE:?warp_size@cuda@at@@YAHXZ in your link arguments; many dynamic linkers will delete dynamic library dependencies if you don't depend on any of their symbols.  You can check if this has occurred by using link on your binary to see if there is a dependency on *_cuda.dll library.

I am running Windows 10 and I installed CUDA. Full output of the command is below:

(env) C:\Users\volodymyr\Documents\Sources\nnue-pytorch>python train.py total_3m_d14.bin total_3m_d16_nnue.bin --lambda 1.0 --val_check_interval 2000 --threads 2 --batch-size 16384 --progress_bar_refresh_rate 20
Feature set: HalfKP^
Num real features: 41024
Num virtual features: 704
Num features: 41728
Training with total_3m_d14.bin validating with total_3m_d16_nnue.bin
Seed 42
Using batch size 16384
Smart fen skipping: False
Random fen skipping: 0
limiting torch to 2 threads.
Using log dir logs/
C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\pytorch_lightning\utilities\distributed.py:49: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
  warnings.warn(*args, **kwargs)
GPU available: False, used: False
TPU available: None, using: 0 TPU cores
Using c++ data loader
Ranger optimizer loaded.
Gradient Centralization usage = True
GC applied to both conv and fc layers

  | Name   | Type   | Params
----------------------------------
0 | input  | Linear | 10.7 M
1 | l1     | Linear | 16.4 K
2 | l2     | Linear | 1.1 K
3 | output | Linear | 33
----------------------------------
10.7 M    Trainable params
0         Non-trainable params
10.7 M    Total params
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "train.py", line 99, in <module>
    main()
  File "train.py", line 96, in main
    trainer.fit(nnue, train, val)
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 470, in fit
    results = self.accelerator_backend.train()
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\pytorch_lightning\accelerators\cpu_accelerator.py", line 62, in train
    results = self.train_or_test()
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 69, in train_or_test
    results = self.trainer.train()
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 492, in train
    self.run_sanity_check(self.get_model())
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 690, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 593, in run_evaluation
    for batch_idx, batch in enumerate(dataloader):
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\torch\utils\data\dataloader.py", line 435, in __next__
    data = self._next_data()
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\torch\utils\data\dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\env\lib\site-packages\torch\utils\data\_utils\fetch.py", line 46, in fetch
    data = self.dataset[possibly_batched_index]
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\nnue_dataset.py", line 151, in __getitem__
    return next(self.iter)
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\nnue_dataset.py", line 89, in __next__
    tensors = v.contents.get_tensors(self.device)
  File "C:\Users\volodymyr\Documents\Sources\nnue-pytorch\nnue_dataset.py", line 32, in get_tensors
    white_values = torch.from_numpy(np.ctypeslib.as_array(self.white_values, shape=(self.num_active_white_features,))).pin_memory().to(device=device, non_blocking=True)
RuntimeError: Pinned memory requires CUDA. PyTorch splits its backend into two shared libraries: a CPU library and a CUDA library; this error has occurred because you are trying to use some CUDA functionality, but the CUDA library has not been loaded by the dynamic linker for some reason.  The CUDA library MUST be loaded, EVEN IF you don't directly use any symbols from the CUDA library! One common culprit is a lack of -INCLUDE:?warp_size@cuda@at@@YAHXZ in your link arguments; many dynamic linkers will delete dynamic library dependencies if you don't depend on any of their symbols.  You can check if this has occurred by using link on your binary to see if there is a dependency on *_cuda.dll library.

Thanks for any help or hint for the issue.
With best regards,
Volodymyr

NNUE

Hi Admin

The Nnue File Size Is Too Large

Large input weights fluctuations for different king positions

I'm trying to get some more grip onto the input layer and decided to change the order of piece and king indices (along with undoing the king index flipping, in order to not let my brain explode in interpreting the plots). Here's the diff w.r.t. master:

--- a/visualize.py
+++ b/visualize.py

-                inpos = [(7-kipos[0])+pipos[0]*8,
-                         kipos[1]+(7-pipos[1])*8]
-                d = - 8 if piece < 2 else 48 + (piece // 2 - 1) * 64
+                inpos = [8*kipos[0]+pipos[0],
+                         8*(7-kipos[1])+(7-pipos[1])]
+                d = -2*(7-kipos[1]) - 1 if piece < 2 else 48 + \
+                    (piece // 2 - 1) * 64

The plots become:

reordered nn-62ef826d1a6d nnue_input-weights
reordered epoch=427 ckpt_input-weights

Some strange things/observations:

  • There is something special going on with king on first rank for the queen not on first rank weights, both in master and in vondele net. Example hidden input neuron (there are a lot more of them): BUG in code, forgot to update offset...

  • Look at this weird pattern in vondele net:
    image
    The only explanation I can give for it is that the hidden input neuron serves the purpose of detecting two "hidden features" at once. If the king is on the back rank (or on a6, a7, d7, h6 or h7), something completely different is happening/detected. Either that, or some gap in training data?

Correlation of input weights for different king positions/nets

I have been investigating the correlation between nets today and I would like to share my results (I don't really know a better place to put them than here :)).

First, denote the (coalesced) relevant input weights of a net as $w_{k, s, f}$, which is a real number for each king index k (from 0 to 63), piece square index s (from 0 to 623), and (hidden) feature index f (from 0 to 255).

In the sequel, we consider master net nn-62ef826d1a6d.nnue (an offspring from sergioviero's run) and nn-0f63c1539914.nnue (latest vdv net on fishtest).

Let's first investigate the correlation coefficient between the weights of the same net as of function of king index, defined as:

image

where we adopt the summation convention that <.> averages over all non-primed indices, so that the correlation coefficient corresponds to the usual definition and, if defined, leads to a positive semi-definite matrix (namely the covariance matrix of the input weights, normalized per king position).

Visually this leads to:
master_king
vdv_king

Observations:

  • The "piece" weights are pretty correlated for different king indices (note the scale!).
  • Both master and vdv net have lower correlation for indices 1 (square b1) and 6 (square g1). I cannot explain this, any ideas? EDIT: g1 corresponds to 0-0, b1 to 0-0-0 followed by prophylactic Kb1 :). Hmm, maybe it's just "black castles short" due to weird rotational symmetry atm.
  • There are some other differences (other patterns) between master and vdv net that I cannot explain. Maybe related to coalescing (factorizer on)?

In the same way, let's define the correlation coefficient between the weights of two nets i and j (can be the same), as a function of the hidden feature index, as:

image

where again <.> averages over all non-primed subindices.

Then we have the following correlation coefficient matrix between master and vdv features:

cross

Now, let's reorder the features so that they come in pairs, so as to maximize the correlation. Then we get this (note that the diagonal in the off-diagonal block matrices (corresponding to cross-terms $N^{(i,j)}_{f'} between nets, i.e., for different i and j) becomes visible and the all four block matrices look visually the same):

cross_reordered

Now, let's plot the input weights of the reordered nets (such that the most correlated features appear from left to right, top to bottom):

master
vdv

Note that I have normalized the weights w.r.t. their max. value for each feature!

Visually, some of the same "hidden feature recognizers" are distinguishable (same patterns), which is pretty amazing, considering the nets originate from totally different independent training runs (nodchip vs pytorch, but also other training data and parameters I guess).

If someone is interested, the code can be found here: https://github.com/ddobbelaere/nnue-pytorch/tree/correlation-research

Console output interpretation

Input data

Training pos: 500m depth 5
Validation pos: 1m depth 10

Command line

python train.py --flush_logs_every_n_steps 1 --max_epochs 200 --smart-fen-skipping --batch-size 8192 --threads 4 --num-workers 4 --gpu 1 ncsf_2021-02_pos500000000_d5_train_tn20.binpack ncsf_2021-02_pos1000000_d10_val_tn10.binpack

Console output

Epoch 1:  39%|█████████▍              | 4846/12331 [05:59<09:15, 13.47it/s, loss=0.0136, v_num=1]

What are the following?
4846/12331 [05:59<09:15, 13.47it/s, loss=0.0136, v_num=1]

Network layer sizes are global

This means that the sizes in .ckpt cannot be inferred and must match the original script. This makes it harder to explore different sizes. The sizes of the layers should be a parameter to NNUE.

The deserialization from .nnue cannot infer the layer sizes sadly, but that's not a big limitation as this use case is not common.

Ranger GC when the weights are transposed.

With the introduction of the new kernel the weights of the feature transformer are now transposed, which means ranger's GC is applied on a different axis. Maybe that's the source of the regression? Will try to disable GC and will have more data tomorrow.

This investigation follows a weird regression from vondele and weird evaluations favoring black even in startpos in my recent nets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.