leelachesszero / lczero-training Goto Github PK

View Code? Open in Web Editor NEW

142.0 142.0 116.0 492 KB

For code etc relating to the network training process.

Python 95.28% Shell 4.72%

lczero-training's People

Contributors

Stargazers

Watchers

Forkers

tilps killerducky borg323 dkappe cyanogenoid ganeshkrishnan1 awesome-archive alexgreason dubslow jamieo53 aysz88 suzunohara ttl oscardssmith csslab jimmyai uyhcire kirill57 jeremylorino maslam jlherren danieluranga remiomosowon hasanshaukat shailchoksi jhorthos alxlk remdu joeismad lutzy shyamalschandra stjordanis silviuburz kiudee gideaovicente somnathmukherjee81 ghotrix ssicreative83 fpli-mbr almaudoh ajithcj dboingue aartdappel amphoria sergiovieri andrijdavid srijan1214 aneternalgoldenbraid skondrashov followmedown orchardbirds theseapples alicihan tesseracta d3v3l0 markusschmaus hans-ekbrand aharonazulay rebeccaloran cxz junfengy seekinglambda scchess yangkevin2 xwzheng1020 daylen alok-kinesso beethoven9 dennispiskovatskov vaikunth-coder27 rocketknight1 znirzej chibanglw masterkni6 ofekshochat cursedseraphim dje-dev cyberflamego arcturai varghs ycao0001 tanpatil ergodice mr-twave chanind mathdieckman squarishrectangle scallybag patrickfrank1 ponder-lab tathor official-pikafish quinniboi10 bogettigian shinyryo dualword chinchangyang mauricett chesstrainer aakarsh

lczero-training's Issues

Model cannot be saved in a saved model format

Loading the tfprocess.py with a yaml, than using the model.save('somepath') or either tf.keras.models.save_model() function does not work and return with the error:
TypeError: Unable to serialize 0.319471538066864 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor'>.

Add ratio of weights:updates to TensorBoard

See here for a description.
It seems that Minigo has already added it to their TensorBoard.
This would help us determine when to drop the learning rate greatly.

Training data is missing

Where is Training data ?
It is missing from there how to train it now???

Training script crashes after hitting a checkpoint

Hi,
I tried discussing this on Discord, but I didn't see a clear response. The issue that I am facing currently is that after every checkpoint, or in some cases after every 1000 cases an assertion error is thrown.
message.txt
I have tried various fixes to this problem (changing the parser, changing CUDA version, changing Python version, changing TF version etc). But it continues to persist.
I referred the following link and tried to fix it, but it still didn't work:
tensorflow/tensorflow#35100
Please help me rectify this issue.

My current PC configuration:
8 GB RAM
Intel i7-4790K
NVIDIA RTX 2070 SUPER
1TB SSD

My current requirement setup:
CUDA 11.3
CUDNN 8.2.1
Python 3.9.5
TF-Nightly GPU (2.7.0 dev)

Thank you.

404 at training data link

I am getting a 404 error at the link where the training data ".tar.gz" file exists. Any update on where to find the training data and how to use it?

ChunkParser fails with more than 1 worker in Python 3.7

I've been trying to mess around with using AdamWOptimizer instead of MomentumOptimizer using tensorflow 1.13.1, but ChunkParser crashes because in Python 3.7 Process isn't picklable.

Using 10 worker processes.
Traceback (most recent call last):
  File "train.py", line 159, in <module>
    main(argparser.parse_args())
  File "train.py", line 109, in main
    shuffle_size=shuffle_size, sample=SKIP, batch_size=ChunkParser.BATCH_SIZE)
  File "C:\Users\Ryan\PycharmProjects\lczero-training\tf\chunkparser.py", line 96, in __init__
    p.start()
  File "C:\Users\Ryan\Anaconda3\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\Ryan\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\Ryan\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\Ryan\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\Ryan\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle weakref objects
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Ryan\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\Ryan\Anaconda3\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

I believe it's related to this: https://bugs.python.org/issue34034

I can workaround it by just forcing workers to be 1, but obviously that's not ideal.

I've never written any Python mp code before, but if the lc0 devs don't want to make changes to this, could you perhaps suggest where I might look at how to best change this to play friendly with 3.7? Thanks!

question about training

Would it be possible to train a net of LC0 without queens for one leg and then a second leg with queens back?

Discrepancy of batch norm to A0 paper

In A0, 64 TPU workers are used to train the network. That means that with their batch size of 4096, each worker independently processes a subset of every minibatch containing 64 samples. Thus, the batch statistics of the batch norms that each worker computes is actually over these 64 samples, not over the whole minibatch. Our current training code computes batch norm statistics over the whole minibatch.

In this paper, "ghost batch normalisation" is discussed, which is a scheme equivalent to splitting the batch to multiple workers as was done in A0. They find that this improves generalisation performance (see Table 1 "+GBN" column), which should translate to better playing strength for Leela.

My suggestion is to change the training code to include ghost batch normalisation in order to match what is described in the A0 paper. Fortunately, this is simple to implement because tf.layers.batch_normalization supports it through the virtual_batch_size argument. In the three places in tfprocess.py where it is called, it simply needs another argument virtual_batch_size=64 passed to it. No other changes should be needed.

跑谱

如何才可以加入跑谱呢，我现在最新的代码（10256）不支持现在已有的棋谱（8356），我有闲置的3090显卡，可以加入跑谱

assertion error

My config:

%YAML 1.2
---
    
    #EARLY RELEASE###########################################################################
    # If you are reading this, this guided-example.yaml is still a work in progress,
    # Please read it over and make add corrections/improvements/explanations/recommended 
    # parameter values and share it back via PM to 12341234 on discord.  Eventually this .yaml
    # will help jumpstart new trainers and reduce the amount redundant work answering the same
    # questions on discord.
    #########################################################################################

    #CREDITS#################################################################################
    # This .yaml was produced by the hard work of many contributor from the Leela Chess Discord Channel
    #########################################################################################
    
    #----------------------------------------------------------------------------------------#
    #----------------------------------------------------------------------------------------#

    #WELCOME#################################################################################
    # Welcome to the training configuration example.yaml!  Here you can experiment on various 
    # training parameters.  Each parameter is commented to help you better understand the params
    # and values. If you are just starting, you can leave most the values as they are, (just change
    # "YOURUSERNAME" and YOURPATH). Try and familiarize yourself with the options avaiable below.  
    # Happy training!
    #########################################################################################

    #INITIAL SETUP###########################################################################
    # Choose a name which properly reflects the parameters of your training.  In this examples
    # I use "ex" for example, separated by a dash (ideally use no spaces in the name)
    # and 64x6 for the network size, which is defined at the end of this configuration file.
    # I added se4 to show the parameter "squeeze/excite" is at a ratio of 4
    # Of course you can come up with your own logical convention or fun ones such as terminator
    # or Cyberdyne Systems Model 101, but try to add meta information to help identify specific
    # training parameters, such as Schwarzenegger-64x6-se10, etc...
name: 'asd-ok'
    #----------------------------------------------------------------------------------------
    # gpu: 0, sets the GPU ID to use for training. Leela can only can use 1 GPU for training
gpu: 0
    ##########################################################################################
        

# dataset section is a group of parameters which control the input data for training and testing
dataset: 

    #TRAINING/TESTING DATA#####################################################################
    # input Edit directory path to reflect your prepared data
    input: '/home/france1/t60data/prepareddata/'
    #----------------------------------------------------------------------------------------
    # train_ratio value is the proportion of data divided between training and testing.
    # This example.yaml is setup for a one-shot run with all data in one directory.
    # .65=65% training /35% testing, .70=70/30, .80=80/20, .90=90/10, etc...
    train_ratio: 0.90                   
    #----------------------------------------------------------------------------------------
    # For manully setting up separated test and train data, uncomment input_train and input_test below
    # and comment out train_ratio and train above
    # You can randomly choose what data goes into what folder, but it should be in the proportion 
    # you desire (see training set ratio for example proportions)
    # input_train: 'C:\Users\YOURUSERNAME\lczero-training\t60data\t60preppeddata\train\' # supports glob
    # input_test: 'C:\Users\YOURUSERNAME\lczero-training\t60data\t60preppeddata\test\'  # supports glob  
    # Manually setting up test and train directories allows you to have more control over what data is
    # used for testing and training.
    ############################################################################################
    
    
    #CHUNKS###################################################################################
    # Chunks are input samples needed for training.  For Leela these samples come from chunk files
    # which have at least one and sometimes many games.  Samples are positions selected randomly
    # from the games in the chunk files.  
    #----------------------------------------------------------------------------------------
    # num_chunks specifies the number of chunks from which to select samples. It uses only the X most
    # recent chunks to parse for rolling training that the RL (reinforcement learning) method uses.
    # For instance, if you have 10 million games in the training path, it only uses X most # recent ones.
    # Advanced info: It is quite common to train with hundreds of thousands of chunk files.
    # When starting training, the code will create a list of all the files and then sort them by
    # date/time stamp.  This can take a considerable amount of time.  Consuming the files in order 
    # is important for reinforcement learning (RL) and less so for supervised learning (SL).   
    # So, if you are doing SL, you can just comment out the sort in train.py currently at lines 51 
    # and 62 [chunks.sort(key=os.path.getmtime, reverse=True)] and it will run much faster.  
    # Also, Windows typically slows down with more than about 30-50,000 files in a directory, so use
    # a multi-tier directory structure for a large number of files.  If you get the not enough chunks
    # message with zero, it usually means there is an error in the input: parameter string pointing to
    # the top chunk file directory.  Because syntax errors are common (at least for me), I like to comment
    # out the sort to make sure the chunk files are being found first, and then restart training with 
    # the sort if doing RL.
    num_chunks: 100000                  
    # -----------------------------------------------------------------------------------------
    # allow_less_chunks: true allows the training to proceed even if there are fewer chunk files than 
    # num_chunks, as will almost always be the case. Earlier versions of the training code did not
    # have this option and training would fail with a not enough chunks message. 
    allow_less_chunks: true
    # num_chunks_train sets the value of the amount of chunks to be used for training data.
    # this is not needed if you are using the train_ratio set.
    #num_chunks_train: 10000000
    # num_chunks_train sets the value of the amount of chunks to be used for training data.
    # Also not needed if you are using the train_ratio set.
    #num_chunks_test: 500000
    ########################################################################################### 

    
    # ADVANCED DATA OPTIONS ###################################################################
    # The games in PGN format are converted in input bit planes in the chunk files during self-play
    # (controlled by the client) or using the trainingdata-tool (typically for SL).  During training
    # the input bit planes are converted into tensor format for Tensorflow processing.  This is done
    # one of two ways.  The most recent training code uses protobufs. 
    # experimental_v5_only_dataset automatically creates workers for reading and converting input. 
    # The older way uses the chunkparser.py code, which is part of a Python generator to provide the 
    # samples and uses additional processes called workers.  Reading and converting the input chunk files
    # can constrain training performance, so more workers is generally good.  In addition, is is best to
    # keep them on a fast SSD disk.  More workers can be created than physical cores, depending on your
    # system, so experiment. First get training working at using the most simple options, then experiment
    # and try different things to train faster.  Training speed for smaller nets (10b size) will often be
    # limited by the disk reads or chunk processing in the CPU.  Larger nets will quickly become limited
    # by the GPU speed. However, if you use the newer format the workers are created automatically.
    # So, start with experimental_v5_only_dataset: true 
    experimental_v5_only_dataset: true
    #----------------------------------------------------------------------------------------
    # train_workers manually specifies the number of workers to create for reading and converting the input
    # for training
    #train_workers=x
    #----------------------------------------------------------------------------------------
    # test_workers manually specifies the number of workers to create for reading and converting the input 
    # for testing
    #test_workers=x
    #----------------------------------------------------------------------------------------
    #input_validation: 'C:\YOURDIRECTORY\validate_v5\'
    ############################################################################################   
   
   
# dataset section is a group of parameters which control the testing and training parameters   
training:


    #PATH SETUP###############################################################################
    # path specifies where to save your networks and checkpoints
    path: '/home/france1/t60data/trainednet/'
    ########################################################################################## 
    
    
    #TEST POSITIONS###########################################################################
    # Total number of steps for training
    num_test_positions: 10000           
    ##########################################################################################
    
    
    #BATCH SETTINGS##################################################################################    
    # batch_size: value specifies the number of training positions per step sent in parallel to the GPU.  
    # A step is the fundamental training unit) my usual is 4096.
    # A good rule of thumb is for every 1 Million training games, use 10k steps with 4096 batch,
    # which is 40,960,000 positions sampled or just over 40 positions per game.  In the same way, with
    # a 2048 batch use 20k steps. More sampling than this is may lead to overfitting from oversampling the # same games.
    # Each training.tar from http://data.lczero.org/files/ contains about 10k games, so every 100 # training.tar files is roughly 1 million games.
    batch_size: 4096                   
    # num_batch_splits is a "technical trick" to avoid overflowing the GPU memory.  It divides the 
    # batch_size into "splits".  Make sure batch_size divides evenly into num_batch_splits.  Although the
    # training speed decrease is small, its best to use the smallest split your VRAM can handle.  A split
    # of 4 (compared to 8) leaves 2x the VRAM headroom with little speed difference.  If the net
    # is very small, num_batch_splits can be 1.  If the net is any decent size it needs to be higher.
    num_batch_splits: 8                   
    ##########################################################################################


    #STOCHASTIC WEIGHT AVERAGING############################################################
    # swa is a weight averaging method that is usually a good idea to apply, most nets use this.
    # It averages weights over some number of recent sets of nets, with net sampling determined by 
    # the parameters.  This “average net” will tend to be a little better than just taking the final net
    # https://arxiv.org/pdf/1803.05407.pdf
    swa: true    #----------------------------------------------------------------------------------------    
    # swa_output turns on the stochastic weight averaging for weights outside of the optimization, averaging 
    # the (non-active) weights over a number of defined steps.  This takes quite the computational resources
    # but is generally desired and stronger than the weights created without this feature.  
    swa_output: true                   
    #----------------------------------------------------------------------------------------    
    # swa steps specifies how many steps in are consecutively averaged
    swa_steps: 20                       
    #----------------------------------------------------------------------------------------    
    # swa_max_n is the cap on n used in the 1/n multiplier for the current weights when merging them with 
    # the current SWA weights.  First its 1/1 - all of it comes from the current weights, then 1/2 - half 
    # and half - then 1/3, etc...  When it reaches n the network is the average of the last n sample points
    # but once n becomes capped its a weighted average of all previous sample points. This is biased towards 
    # more recent samples - but still theoretically has tiny contributions from samples all the way back. 
    # In practice, floating point accuracy means that the old contributions can be ignored at some point.
    # So for a swa_max_n of 10 it means networks farther back than the last 10 networks will not be 
    # multiplied by 1/11, 1/12, 1/13 etc, but instead, 1/10.  The larger n is the less past averages
    # affect the current swa network.  The smaller n is the more past averages affect the new swa average.
    swa_max_n: 10                       
    ##########################################################################################
      
      
    #CONFIGURE STEP TRIGGERS##################################################################
    # eval test set values after this many steps
    test_steps: 2000                    
    #---------------------------------------------------------------------------------------- 
    # training reports values after this many steps.
    train_avg_report_steps: 200         
    #---------------------------------------------------------------------------------------- 
    # validation_steps specifies how long the training will run.  
    # Each step consists of a number of input samples.  This is the batch_size.  So, if the batch_size
    # is 1024 and the number of steps is 100,000, then 102,400,000 input samples will be used.  
    # Note that this is the number of sample positions, not games.  Each game has roughly 150 positions;
    # even so, a large number of games is needed.  Not every position is used in a game.  In fact only
    # every 32nd position generally is.  Generally, with machine learning (ML) the input samples are
    # divided into several groups: training, testing, and validation.  The training samples are the ones
    # used to do the actual learning.  The test samples are used every so often (test_steps parameter)
    # to evaluate how well the net is doing against samples it has not seen before.  Sometimes the
    # validation samples are also used. Like test_steps validation_steps specifies how often the
    # validation data get used.  
    # Also, note that the validation samples should be totally separate from any of the train and test
    # data.  Tensorflow can produce graphs (using tensorboard) that show how the learning is progressing.
    # The train, test and optional validation trends help to know when to change things to improve
    # training, or not.  The validation_steps parameter is relatively new and can certainly be omitted
    # until experimenting with more advanced training.  
    # Finally, note that in general ML training is also measured in epochs.  An epoch is one run through
    # all of the train/test/validation number of steps.  Then, training starts again with the somewhat
    # smarter net using the same collection of input samples.  Most ML training runs for many epochs.
    # With Leela, there is so much data that the epoch concept is not used.  So, when reading the papers
    # about ML that give results after many epochs, keep in mind that the technique used may not work for
    # Leela (which effectively is using just one epoch).
    validation_steps: 2000              
    #----------------------------------------------------------------------------------------     
    # terminate (total) steps, for batch 4096 <10k steps per million games
    total_steps: 140000                 
    #----------------------------------------------------------------------------------------     
    # optional frequency for checkpoints before finish
    checkpoint_steps: 10000             
    ##########################################################################################


    #RENORMALIZATION##########################################################################
    # Renormalizing outputs for residual layers, basically when activation values (not weights)
    # for each layer are computed, it redistributes them to match some preferred distribution 
    # (renormalizes). These should start low and gradually increase.
    renorm: false
    #---------------------------------------------------------------------------------------- 
    #Start at renorm_max_r=1.0 renorm_max_d=0.0
    #---------------------------------
    #renorm_max_r=1.1 renorm_max_d=0.1
    #renorm_max_r=1.2 renorm_max_d=0.2
    #renorm_max_r=1.3 renorm_max_d=0.3
    #renorm_max_r=1.4 renorm_max_d=0.4
    #renorm_max_r=1.5 renorm_max_d=0.5
    #(all of the above can happen during the 1st LR (0.2))
    #---------------------------------
    #renorm_max_r=1.7 renorm_max_d=0.7
    #renorm_max_r=2.0 renorm_max_d=1.0
    #renorm_max_r=3.0 renorm_max_d=2.5
    #renorm_max_r=4.0 renorm_max_d=5.0
    #(all of the above can happen during the 2nd LR (0.02))
    #---------------------------------
    #If continuing traing upon a relatively mature net, you can set to final values
    #-------------------------------------------------------------------------------
    #renorm_max_r: 4.0                  # gradually raise from 1.0 to 4.0
    #renorm_max_d: 5.0                  # gradually raise from 0.0 to 5.0 
    #max_grad_norm: 5.0                 # can be much higher than 2.0 after first LR  NEEDS MORE EXPLANATION
    ##########################################################################################

  
    #LEARNING RATE############################################################################    
    # The learning rate (LR) is a critical hyper-parameter in machine learning (ML).  This term is
    # multiplied in the calculation to determine how much to change things after a step.  If the LR
    # is too big, the steps will be too large and not converge to a good value (local or global minimum)
    # as the net "bounces" around the solution space.  If the LR is too little, the net will converge,
    # but very slowly.  And, it may get "stuck" at a local minimum and not be able to see past that to
    # find a better minimum.  There are many papers and techniques about how to pick a starting learning
    # rate, and how and when to change it during training.
    #-------------------------------------------------------------------------------
    # list of LR values
    lr_values:                         
        - 0.02
        - 0.002
        - 0.0005
    #-------------------------------------------------------------------------------
    # list of boundaries in steps
    lr_boundaries:                     
        - 100000                        # steps until 1st LR drop
        - 130000                        # steps until 2nd LR drop
    #-------------------------------------------------------------------------------    
    # warmup_steps specifies how many steps to use before using the first value in the lr_values list until
    # the number of steps exceeds the first value in the lr_boundaries list.  Since small LR values will
    # head in a relatively good direction, small values are initially used during a "warm up" period to
    # get the net started.  Use the default until value before attempting more advanced training.
    warmup_steps: 250                  
    ##########################################################################################
 

    #LOSS WEIGHTS##############################################################################
    # Each loss term measures how far the current net is from matching the training data,
    # there is policy loss, value loss, moves left loss and (optionally) Q loss
    # for each of those loss terms, a "backpropagation" method changes the net weights to better 
    # match the training data
    # the measurement is repeated on batch_size random positions from training data and the average
    # losses are used backpropagation
    # Each such batch is a "step"
    # it isn't all that clear whether Q loss is useful, sometimes I use it and sometimes I don't
    policy_loss_weight: 1.0            # weight of policy loss, value range: 0-1
    #-------------------------------------------------------------------------------     
    value_loss_weight:  1.0            # weight of value loss, values range: 0-1
    #-------------------------------------------------------------------------------     
    moves_left_loss_weight: 0.0        # weight of moves_left loss, values range: 0-1
    ##########################################################################################

   
    #Q-RATIO##################################################################################
    # The loss_weight of q ratio + z outcome = 1.
    # Q = W-L (no draw)
    # (ex: q_ratio of .35, sets z_outcome_loss_weight to .65)
    q_ratio: 0.00                      
    ##########################################################################################

    
    #SHUFFLE#############################################################################
    #This controls the "Stochastic part of training, basically loading large sets of positions and randomizing their order.
    # As positions get used in training, it replaces those with new random positions.
    shuffle_size: 200000                # typcially 500k or more, but you can get away with alot less
    ##########################################################################################
    
    
    #MISCELLANEOUS###NEEDS SORTED TO PROPER CATEGORY IN .YAML, IF ONE EXISTS, OTHERWISE STAYS IN MISC##
    # lookahead optimizer is not supported by Lczero published training scripts, it has not shown any
    # significant improvements.
    #lookahead_optimizer: false         
    mask_legal_moves: true             # Filters out illegal moves.    NEEDS BETTER EXPLANATION
    #-------------------------------------------------------------------------------
    # precision: 'single' or 'half' specifies the floating point precision used in the training calculations.
    # Single is tf.float32 and half is tf.float16  Some GPUs can use different precision formats
    # faster than others.  Even with tf.float16 which is not a accurate as tf.float32, the key internal
    # calculations requiring the most precision are done with tf.float32.
    # Start with single and once training is working see how your GPU does with half.  
    # It may not be supported or slower or faster.
    precision: 'single'
    #-------------------------------------------------------------------------------

# model section is a group of parameters which control the architecture of the network
model:
    
    
    #NETWORK ARCHITECTURE#####################################################################
    filters: 64                         # Number of filters
    residual_blocks: 6                  # Number of blocks
    #-------------------------------------------------------------------------------
    se_ratio: 4                         #Squeeze Excite structural network architecture.
    # (SE provides significant improvement with minimal speed costs.)
    # The se-ratio should be based on the net size.  
    #(Not all values are supported with the backend optimizations.)  A common/default value is 8
    # You want the se_ratio to divide into your filters evenly, (at least a multiple of 8 32 and 64
    # are popular target outcomes)
    # A recent update to the codebase means that for 320 filters, SE ratio 10 and 5 are optimized chess
    # fp16 flows using a fused SE kernel)
    ##########################################################################################
    ##########################################################################################
    
    #POLICY##################################################################################  
    # policy: the default is convolution with classical option.  
    # policy: convolution allows skipping blocks.
    policy: 'convolution'             
    #policy: classical option does not have block skipping                
    # policy_channels are only used with "policy: classical" architecture.  Default value is 32.
    # The value is the number of outputs of a specific layer, in this case the policy head
    # policy_channels: 80              
    ##########################################################################################
    ##########################################################################################
    
    #MISCELLANEOUS###NEEDS SORTED TO PROPER CATEGORY IN .YAML, IF ONE EXISTS, OTHERWISE STAYS IN MISC##    
    # value: 'wdl' stands for )wins/draws/losses), default is value: 'wdl', can optionally be false. 
    value: 'wdl'                          
    # moves_left is a recent third type of head, in addition to policy and value.  It tries to predict
    # game length to reduce endgame shuffling.  Default value is moves_left: 'v1', optional value is false.
    moves_left: 'none'                    
    # input_type default value is classic
    # input_type: 'frc_castling' changes input format for FRC (Fischer Random Chess)
    # input_type: 'canonical' changes input format so that enpassant is passed to the net, rather
    # than inferred from history.  History is clipped at last 50 move reset or castling move.  If
    # no castling rights, board is flipped to ensure your king is on the righthand side.  If no pawns
    # (and no castling rights) board is flipped/mirrored/transposed to ensure king is in bottom right
    # octant towards the middle.  If king  is on the diagonal, other rules apply to decide whether to
    # transpose or not, to ensure a consistent view of the board is always given to the net.
    # Also, the castling information is encoded for the net to understand FRC castling.
    # input_type: 'canonical_100' is the same as 'canonical' but divide 50 move rule by 100 instead.
    # input_type: 'ccanonical_armageddon' changes input format for armageddon chess variant
    # More on input_type: 
    # Self-play games are saved in PGN and bitplane format which is used for training.  There have been
    # many bitplane versions.  These are the samples (positions) that are randomly picked from chunk file
    # games.  After being generated by self-play, the chunk files are typically rescored.  The rescorer is
    # a custom verision of Lc0 that checks syzygy and Gaviota tablebases (TBs)  to correct the results
    # using the perfect information from the TBs.  After rescoring the chunks are finally ready for 
    # training.  The versions should match. Rescorer is here: https://github.com/Tilps/lc0/tree/rescore_tb
    # The rescorer 'up converts' versions not input types, 3 to 4 or 4 to 5 (with classical input type),
    # 5 (classical) to 5 (canonical).  I would have to study the code more to see the difference between and input "type" and a "version"
    input_type: 'canonical'               
    ##########################################################################################
    ##########################################################################################

And the output:

python train.py --cfg ../../asd.yaml                                            
dataset:
  allow_less_chunks: true
  experimental_v5_only_dataset: true
  input: /home/france1/t60data/prepareddata/
  num_chunks: 100000
  train_ratio: 0.9
gpu: 0
model:
  filters: 64
  input_type: canonical
  moves_left: none
  policy: convolution
  residual_blocks: 6
  se_ratio: 4
  value: wdl
name: asd-ok
training:
  batch_size: 4096
  checkpoint_steps: 10000
  lr_boundaries:
  - 100000
  - 130000
  lr_values:
  - 0.02
  - 0.002
  - 0.0005
  mask_legal_moves: true
  moves_left_loss_weight: 0.0
  num_batch_splits: 8
  num_test_positions: 10000
  path: /home/france1/t60data/trainednet/
  policy_loss_weight: 1.0
  precision: single
  q_ratio: 0.0
  renorm: false
  shuffle_size: 200000
  swa: true
  swa_max_n: 10
  swa_output: true
  swa_steps: 20
  test_steps: 2000
  total_steps: 140000
  train_avg_report_steps: 200
  validation_steps: 2000
  value_loss_weight: 1.0
  warmup_steps: 250

got 20673 chunks for /home/france1/t60data/prepareddata/
sorting 20673 chunks...[done]
training.283088654.gz - training.283109356.gz
Using 30 worker processes.
Using 30 worker processes.
2022-02-27 11:22:38.905570: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-27 11:22:39.533747: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-02-27 11:22:39.552706: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.552831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:07:00.0 name: NVIDIA GeForce GTX 1660 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 22 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 312.97GiB/s
2022-02-27 11:22:39.552844: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-27 11:22:39.554756: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2022-02-27 11:22:39.554778: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2022-02-27 11:22:39.569353: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-02-27 11:22:39.569492: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-02-27 11:22:39.569837: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2022-02-27 11:22:39.571197: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2022-02-27 11:22:39.571279: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-02-27 11:22:39.571334: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.571466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.571564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2022-02-27 11:22:39.572156: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-27 11:22:39.573005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.573113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:07:00.0 name: NVIDIA GeForce GTX 1660 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 22 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 312.97GiB/s
2022-02-27 11:22:39.573144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.573255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.573351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-02-27 11:22:39.573368: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-27 11:22:39.852428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-27 11:22:39.852457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2022-02-27 11:22:39.852462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2022-02-27 11:22:39.852602: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.852737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.852840: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-27 11:22:39.852941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1893 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1660 SUPER, pci bus id: 0000:07:00.0, compute capability: 7.5)
2022-02-27 11:22:39.982744: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-02-27 11:22:39.996678: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3401190000 Hz
Using 19 evaluation batches
2022-02-27 11:22:40.993055: W tensorflow/core/framework/op_kernel.cc:1755] Unknown: AssertionError: 
Traceback (most recent call last):

  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 249, in __call__
    ret = func(*args)

  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 645, in wrapper
    return func(*args, **kwargs)

  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 961, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 565, in parse
    for b in gen:

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 551, in batch_gen
    s = list(itertools.islice(gen, self.batch_size))

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 542, in tuple_gen
    yield self.convert_v6_to_tuple(r)

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 335, in convert_v6_to_tuple
    assert input_format == self.expected_input_format

AssertionError


Traceback (most recent call last):
  File "train.py", line 257, in <module>
    main(argparser.parse_args())
  File "train.py", line 234, in main
    batch_splits=batch_splits)
  File "/home/france1/smr-hdd/asd/lczero-training/tf/tfprocess.py", line 625, in process_loop
    self.process(batch_size, test_batches, batch_splits=batch_splits)
  File "/home/france1/smr-hdd/asd/lczero-training/tf/tfprocess.py", line 811, in process
    self.calculate_test_summaries(test_batches, steps + 1)
  File "/home/france1/smr-hdd/asd/lczero-training/tf/tfprocess.py", line 933, in calculate_test_summaries
    x, y, z, q, m = next(self.test_iter)
  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 761, in __next__
    return self._next_internal()
  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 747, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2728, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: AssertionError: 
Traceback (most recent call last):

  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 249, in __call__
    ret = func(*args)

  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 645, in wrapper
    return func(*args, **kwargs)

  File "/home/france1/anaconda3/envs/lc0-training/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 961, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 565, in parse
    for b in gen:

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 551, in batch_gen
    s = list(itertools.islice(gen, self.batch_size))

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 542, in tuple_gen
    yield self.convert_v6_to_tuple(r)

  File "/home/france1/smr-hdd/asd/lczero-training/tf/chunkparser.py", line 335, in convert_v6_to_tuple
    assert input_format == self.expected_input_format

AssertionError


         [[{{node PyFunc}}]] [Op:IteratorGetNext]

Started it with python train.py --cfg ../../asd.yaml
Had to install tensorflow-gpu not tensorflow, the requirements file is pretty broken
input_validation: 'C:\YOURDIRECTORY\validate_v5\' Is this required? What is this.

Softmax Policy Target

I discovered this afternoon that if you give a non zero policy training weight with data where the policy that doesn't add up to 1, the reg term goes absolutely berserk (I've seen reg losses of 5000). think this happens because the net is trying to reach an impossible policy distribution. Would it be a significant slowdown to either re-normalize the policy target or to have a warning if the sum of your policy head isn't approximately 1?

Training cfg error

I am getting error "load() missing 1 required positional argument: 'Loader'" when trying to run train.py. it the whole error message i am not getting any line number of module name that is causing this error.
I am using this code on python version 3.6

Input pipeline

Hello,

I profiled the training and it seems that GPU operation and CPU operation are not done in parrallel as they should with the line "dataset.prefetch(4)" in train.py
Here is a screenshot of what I mean:
https://imgur.com/a/tHHQ3OK

So I tried something simple. I converted the dataset into a TFRecordDataset and read from that instead of the current pipeline. The resulting pipeline was 1) much faster 2) executed in parallel, here is the new profile:
https://imgur.com/a/Vlh2VmG

On a K80 it roughtly doubled (EDIT: after removing the profiler it is x4.7) the pos/s on a 6x64 network.

Here is some code to transform into a TFRecordDataset:

def _bytes_feature(value):
    value = tf.compat.as_bytes(value.tostring())
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def write_dataset(dataset, train_iterator, test_iterator, train_ratio):
    session = tf.Session()
    handle = tf.placeholder(tf.string, shape=[])
    iterator = tf.data.Iterator.from_string_handle(
        handle, dataset.output_types, dataset.output_shapes)
    next_batch = iterator.get_next()
    handles = {'train': session.run(train_iterator.string_handle()),
               'test': session.run(test_iterator.string_handle())}
    x = next_batch[0]  # tf.placeholder(tf.float32, [None, 112, 8*8])
    y_ = next_batch[1]  # tf.placeholder(tf.float32, [None, 1858])
    z_ = next_batch[2]  # tf.placeholder(tf.float32, [None, 1])

    filenames = {'train': 'train_bytes2', 'test': 'test_bytes2'}

    options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.GZIP)
    writers = {key: tf.python_io.TFRecordWriter(filenames[key], options=options)
               for key in filenames}
    train_every = int(train_ratio / (1 - train_ratio)) + 1
    for i in range(200):
        t = time.time()
        key = 'train'
        if not i % train_every:
            key = 'test'
        datas = session.run([tf.reshape(x, [-1, 112 * 8 * 8]), y_, z_],
                            feed_dict={handle: handles[key]})
        assert datas[0].shape[0] == datas[1].shape[0] == datas[2].shape[0]
        batch_size = datas[0].shape[0]
        for k in range(batch_size):
            x_raw = np.array(datas[0][k])
            _y_raw = np.array(datas[1][k])
            _z_raw = np.array(datas[2][k])

            example_bytes = tf.train.Example(
                features=tf.train.Features(
                    feature={
                        'x': _bytes_feature(x_raw),
                        '_y': _bytes_feature(_y_raw),
                        '_z': _bytes_feature(_z_raw)
                    }))
            writers[key].write(example_bytes.SerializeToString())
        print(key, (time.time() - t) / batch_size, batch_size / (time.time() - t))

    for key in writers:
        writers[key].close()

And here to read it :

    def extract(example):
        features = {
            'x': tf.FixedLenFeature((), tf.string),
            '_y': tf.FixedLenFeature((), tf.string),
            '_z': tf.FixedLenFeature((), tf.string)
        }
        parsed_example = tf.parse_single_example(example, features)
        x = tf.decode_raw(parsed_example['x'], tf.float32)
        _y = tf.decode_raw(parsed_example['_y'], tf.float32)
        _z = tf.decode_raw(parsed_example['_z'], tf.float32)
        x.set_shape([112 * 64])
        _y.set_shape([1858])
        _z.set_shape([1])
        x = tf.reshape(x, [112, 64])
        return x, _y, _z

    filenames = {'train': 'test_bytes', 'test': 'test_bytes'}

    dataset = tf.data.TFRecordDataset(filenames=[filenames['train']],
                                      compression_type='GZIP')
    dataset = dataset.map(extract)
    dataset = dataset.batch(total_batch_size)
    dataset = dataset.prefetch(4)
    train_iterator = dataset.make_one_shot_iterator()

    dataset = tf.data.TFRecordDataset(filenames=[filenames['test']],
                                      compression_type='GZIP')
    dataset = dataset.map(extract)
    dataset = dataset.batch(total_batch_size)
    dataset = dataset.prefetch(4)
    test_iterator = dataset.make_one_shot_iterator()

The reason I am interested in this is that I would like to train very small networks to try different architectures but the bottleneck becomes the input pipeline for small networks.

I think pre-read the data to write them in TFRecords format is worth. Or there is simpler solution?
Do you have any thought on that? I have no idea how the current pipeline works.

Thanks

EDIT: Actually when removing the profiler the gain was even bigger.
With current pipeline: (773.638 pos/s)
With TFRecordDataset: (3661.37 pos/s)
This is still with K80, 6x64 network, batch_size: 1024 and no batch split.

Functions get_weights() and fill_net() in tf/net.py not up-to-date

Functions have not been updated for a long time and don't support recent networks. I was trying to use them for saving network data as *.txt and understanding the network structure and eventually tuning weights with my own scripts.

decode_training.py does not support V5 training format data

decode_training.py currently lacks support for reading V5 training data.

requirements file is broken

tensorflow requires a specific numpy version, numpy can be removed from that list since it's a dependency of tensorflow and usually tensorflow-gpu is required

(Small) Possibility of Duplicated Training Games

As the title indicates: It is possible that, despite current temperature settings, there may be duplicate games in the training data. If this is indeed the case, then any duplicate games should probably be removed.

How to load the weight pb file to tensorflow model?

Hi,
I followed readme and put the downloaded pb file "192x15-2022_0418_1738_54_779.pb" to the example.yaml training:path.

It always raises error:
This is the tensorflow code to read proto model. However, I always get file decoding error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 1: invalid start byte

May I know that how to load the pb file correctly?

Thank you!

Question about the meaning of input shape and output shape

The input shape plane is (batch_size, 112, 8, 8) and the output shape is (batch_size, 1858). What is the meaning of 112 and the 1858? In other words, what should I input to the model and what I should expect the model to output? Thank you!

update the answer from the chat:

About the input

Input planes are encoded using this function https://github.com/LeelaChessZero/lc0/blob/master/src/neural/encoder.cc#L134
It's quite complicated, but recent networks use the INPUT_CLASSICAL_112_PLANE encoding which is simple, and it's also the same encoding that AlphaZero used -- you may check AlphaZero paper.

Different from AZ paper (https://arxiv.org/pdf/1712.01815.pdf, page 13), their total number of input channels seems to be 119 rather than 112.

In lc0, the "repetitions" is set to be 1 and there is one more feature which is all one to help conv to detect edge.

(6+6+1)x8+1+1+2+2+1 +1=112
which is (P1 piece + P2 piece + Repetition)*history_len + Colour + Total move count + P1 castling + P2 castling + No-progress count + [one more feature which is all one to help conv to detect edge]

Board Representation:

N × N × (MT + L) Image Stack: The state of the game is represented by a three-dimensional array, where:
N × N represents the chessboard dimensions (8x8 for chess). T = 8.
MT + L refers to the depth of the stack, with different layers representing various game features.

T Sets of M Planes:

These sets represent the game state at different time-steps, providing the network with a history of the game. Each set contains M planes, with each plane being an N × N grid.

Binary Feature Planes:

These planes indicate the presence of each player's pieces on the board. For example, there could be separate planes for the player's pawns, knights, bishops, etc., and similar planes for the opponent's pieces.

Board Orientation:

The board is oriented from the perspective of the current player, meaning the network always 'sees' the board as if it were making the next move.

Additional Features:

L Constant-valued Input Planes: These planes provide additional contextual information about the game:
Player's Color: Which player (white or black) the neural network is analyzing the position for.
Total Move Count: The number of moves made in the game so far.
Special Rules: Information about game-specific rules, such as castling rights in chess, repetition counts, and no-progress counts.

Chess-Specific Implementation:

From Table S1, we can see how these features are quantified for chess:

P1 piece and P2 piece (6 each): Indicates the presence of each type of piece (pawn, knight, bishop, rook, queen, king) for both players.
Repetitions (2): Tracks the repetition of positions, important for threefold repetition rules.
Other Features (7 in total): Includes color, total move count, castling rights for both players (kingside and queenside separately), and no-progress count.

About the output

output[0]

https://github.com/LeelaChessZero/lc0/blob/master/src/chess/bitboard.cc#L36

Basic Moves: The majority of these moves are in the format 'e2e4', 'd7d5', etc. This is standard algebraic notation, where the first two characters represent the starting square (file and rank) and the next two characters represent the ending square. For example, 'e2e4' means moving a piece from square e2 to e4.

Promotions: Towards the end of the list, there are moves like 'a7a8q', 'b7b8r', etc. These represent pawn promotion moves. In chess, when a pawn reaches the opposite end of the board, it must be promoted to a queen, rook, bishop, or knight. In these notations, the first four characters denote the move (e.g., a pawn moving from a7 to a8), and the fifth character denotes the piece the pawn is promoted to ('q' for queen, 'r' for rook, 'b' for bishop, and 'n' for knight).

output[1] and output[0]

there are two additional ones, the "value" typically (batch_size, 3) which is the win, draw, loss probabilities for the current position and the optional "moves left" which is (batch_size, 1) giving an estimate of the moves left in the game.

Make batchnorm gammas trainable for reset

jio on the Discord suggested this. He never got around to making an issue for it, as far as I saw. I figured I would make one, lest it be forgotten.

[2:36 PM] Error323: Why weren't they I wonder. Was this not required in leelago @jio ?
[2:44 PM] jio: @Error323 It's an old oversight in LZ network architecture
[2:44 PM] jio: they should be trainable in LZ too
2:45 PM] jio: however they can't be enabled while maintaining backwards compatibility so it's a little tricky to enable them
[2:45 PM] Error323: Yeah, so before a new run is an excellent idea. Thanks.

No proto files after git clone execution

admin@a4000-21bn12:/mnt$ git clone https://github.com/LeelaChessZero/lczero-training.git
Cloning into 'lczero-training'...
remote: Enumerating objects: 1238, done.
remote: Counting objects: 100% (139/139), done.
remote: Compressing objects: 100% (93/93), done.
remote: Total 1238 (delta 93), reused 78 (delta 43), pack-reused 1099
Receiving objects: 100% (1238/1238), 473.95 KiB | 1.66 MiB/s, done.
Resolving deltas: 100% (681/681), done.
admin@a4000-21bn12:/mnt$ cd lczero-training
admin@a4000-21bn12:/mnt/lczero-training$ ls
README.md  init.sh  libs  scripts  tf
admin@a4000-21bn12:/mnt/lczero-training$ ./init.sh
libs/lczero-common/proto/net.proto: No such file or directory
libs/lczero-common/proto/chunk.proto: No such file or directory
touch: cannot touch 'tf/proto/__init__.py': No such file or directory
admin@a4000-21bn12:/mnt/lczero-training$

Issue is fixed via downloading files manually from

https://github.com/LeelaChessZero/lczero-common/tree/4dfa4ce8339357819f7de01517e6297d4c768cdf
(download zip)

Folder is wrong there and git clone is not cloning it, it even looks different. This looks like a link to Arcturai branch, proto files are indeed there but clone is not cloning it, physical folder with files needed. Uploading files as PR is not working, it says Uploads are disabled.
File uploads require push access to this repository.

admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ wget https://github.com/LeelaChessZero/lczero-common/archive/4dfa4ce8339357819f7de01517e6297d4c768cdf.zip
--2022-04-28 09:35:54--  https://github.com/LeelaChessZero/lczero-common/archive/4dfa4ce8339357819f7de01517e6297d4c768cdf.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/LeelaChessZero/lczero-common/zip/4dfa4ce8339357819f7de01517e6297d4c768cdf [following]
--2022-04-28 09:35:54--  https://codeload.github.com/LeelaChessZero/lczero-common/zip/4dfa4ce8339357819f7de01517e6297d4c768cdf
Resolving codeload.github.com (codeload.github.com)... 140.82.121.9
Connecting to codeload.github.com (codeload.github.com)|140.82.121.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘4dfa4ce8339357819f7de01517e6297d4c768cdf.zip’

4dfa4ce8339357819f7de01517e6297d4c     [ <=>                                                            ]   4.22K  --.-KB/s    in 0s      

2022-04-28 09:35:54 (83.7 MB/s) - ‘4dfa4ce8339357819f7de01517e6297d4c768cdf.zip’ saved [4318]

admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ ls
4dfa4ce8339357819f7de01517e6297d4c768cdf.zip
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ unzip 4dfa4ce8339357819f7de01517e6297d4c768cdf.zip
Archive:  4dfa4ce8339357819f7de01517e6297d4c768cdf.zip
4dfa4ce8339357819f7de01517e6297d4c768cdf
   creating: lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf/
   creating: lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf/proto/
  inflating: lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf/proto/chunk.proto  
  inflating: lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf/proto/net.proto  
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ ls
4dfa4ce8339357819f7de01517e6297d4c768cdf.zip  lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ cd lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common/lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf$ ls
proto
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common/lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf$ mv proto /mnt/lczero-training/libs/lczero-common/
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common/lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf$ ls
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common/lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf$ cd ..
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ ls
4dfa4ce8339357819f7de01517e6297d4c768cdf.zip  lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf  proto
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ rm 4dfa4ce8339357819f7de01517e6297d4c768cdf.zip
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ rm -rf lczero-common-4dfa4ce8339357819f7de01517e6297d4c768cdf
admin@a4000-21bn12:/mnt/lczero-training/libs/lczero-common$ ls
proto

admin@a4000-21bn12:/mnt/lczero-training$ ./initpr.sh
cd libs/lczero-common/proto && ls
chunk.proto  net.proto
proto files are successfully patched

Analyze blunders

From @mooskagh on May 8, 2018 7:23

Important!

When reporting positions to analyze, please use the following form. It makes it easier to see what's problematic with the position:

Id: Optional unique ID. Come up with something. :) number/word, just to make it easier to refer this position in further comments.
Game: Preferably link to lichess.org (use lichess.org/paste), or at least PGN text.
Bad move: Bad move with number, and optionally stockfish eval for that move
Correct move: Good move to play, optionally with stockfish eval
Screenshot: optional, screenshot of the position, pasted right into the message (not as link!). Helps grasping the problem without following links
Configuration: Configuration used, including lc0/lczero version, operating system, and non-default parameters (number of threads, batch size, fpu reduction, etc).
Network ID: Network ID, very important
Time control: Time control used, and if known how much time/nodes was spent thinking this move
Comments: any comments that you may have, e.g. free word explanation what's happening in position.

(old text below)

There are many reports on forums asking about blunders, and the answers so far had been something along the lines "it's fine, it will learn eventually, we don't know exactly why it happens".

I think at this point it makes sense to actually look into them to confirm that there no some blind spots in training. For that we need to:

Open position in engine and check counters
Try several times to evaluate that move with training configuration "800 playouts, with Dirichlet noise and temperature (--temperature=1.0 --noise)" to see how training data would look like for this position.

Eventually all of this would be nice to have as a single command, but we can start manually.

For lc0, that can be done this way: --verbose-move-stats -t 1 --minibatch-size=1 --no-smart-pruning (unless you want to debug specifically with other settings).

Then run UCI interface, do command:

position startpos moves e2e4 ....

(PGN move to UCI notation can be converted using pgn-extract -Wuci)

Then do:

go nodes 10

see results, add some more nodes by running:

go nodes 20
go nodes 100
go nodes 800
go nodes 5000
go nodes 10000
and so on

And look how counters change.

Counters:

e2e4 N: 329 (+ 4) (V: -12.34%) (P:38.12%) (Q: -0.2325) (U: 0.2394) (Q+U: 0.0069)
 ^      ^    ^      ^           ^          ^            ^           ^
 |      |    |      |           |          |            |           Q+U, see below
 |      |    |      |           |          |           U from PUCT formula,
 |      |    |      |           |          |           see below.
 |      |    |      |           |         Average value of V in a subtree
 |      |    |      |          Probability of this move, from NN, but if Dirichlet
 |      |    |      |          node is on, it's also added here, 0%..100%
 |      |    |     Expected outcome for this position, directly from NN, -100%..100%
 |      |   How many visits are processed by other threads when this is printed.
 |     Number of visits. The move with maximum visits is chosen for play.
Move

* U = P * Cpuct * sqrt(sum of N of all moves) / (N + 1)
  CPuct is a search parameter, can be changed with a command line flag.
* The move with largest Q+U will be visited next

Help wanted:

Feel free to post positions that you think need analyzing (don't forget to also mention network Id used, and also all other settings are nice to know)
Feel free to analyze what other people posted

Copied from original issue: glinscott/leela-chess#558

(issue mover duplicate)

From @mooskagh on May 8, 2018 7:23

Important!

When reporting positions to analyze, please use the following form. It makes it easier to see what's problematic with the position:

Id: Optional unique ID. Come up with something. :) number/word, just to make it easier to refer this position in further comments.
Game: Preferably link to lichess.org (use lichess.org/paste), or at least PGN text.
Bad move: Bad move with number, and optionally stockfish eval for that move
Correct move: Good move to play, optionally with stockfish eval
Screenshot: optional, screenshot of the position, pasted right into the message (not as link!). Helps grasping the problem without following links
Configuration: Configuration used, including lc0/lczero version, operating system, and non-default parameters (number of threads, batch size, fpu reduction, etc).
Network ID: Network ID, very important
Time control: Time control used, and if known how much time/nodes was spent thinking this move
Comments: any comments that you may have, e.g. free word explanation what's happening in position.

(old text below)

There are many reports on forums asking about blunders, and the answers so far had been something along the lines "it's fine, it will learn eventually, we don't know exactly why it happens".

I think at this point it makes sense to actually look into them to confirm that there no some blind spots in training. For that we need to:

Open position in engine and check counters
Try several times to evaluate that move with training configuration "800 playouts, with Dirichlet noise and temperature (--temperature=1.0 --noise)" to see how training data would look like for this position.

Eventually all of this would be nice to have as a single command, but we can start manually.

For lc0, that can be done this way: --verbose-move-stats -t 1 --minibatch-size=1 --no-smart-pruning (unless you want to debug specifically with other settings).

Then run UCI interface, do command:

position startpos moves e2e4 ....

(PGN move to UCI notation can be converted using pgn-extract -Wuci)

Then do:

go nodes 10

see results, add some more nodes by running:

go nodes 20
go nodes 100
go nodes 800
go nodes 5000
go nodes 10000
and so on

And look how counters change.

Counters:

e2e4 N: 329 (+ 4) (V: -12.34%) (P:38.12%) (Q: -0.2325) (U: 0.2394) (Q+U: 0.0069)
 ^      ^    ^      ^           ^          ^            ^           ^
 |      |    |      |           |          |            |           Q+U, see below
 |      |    |      |           |          |           U from PUCT formula,
 |      |    |      |           |          |           see below.
 |      |    |      |           |         Average value of V in a subtree
 |      |    |      |          Probability of this move, from NN, but if Dirichlet
 |      |    |      |          node is on, it's also added here, 0%..100%
 |      |    |     Expected outcome for this position, directly from NN, -100%..100%
 |      |   How many visits are processed by other threads when this is printed.
 |     Number of visits. The move with maximum visits is chosen for play.
Move

* U = P * Cpuct * sqrt(sum of N of all moves) / (N + 1)
  CPuct is a search parameter, can be changed with a command line flag.
* The move with largest Q+U will be visited next

Help wanted:

Feel free to post positions that you think need analyzing (don't forget to also mention network Id used, and also all other settings are nice to know)
Feel free to analyze what other people posted

Copied from original issue: glinscott/leela-chess#558

update training data section in readme

The link to the training data in the README is a 404. I found new data at http://data.lczero.org/files/training_data/ but it's not clear how to point a .yaml file at the data in it since it appears to be a different format now.

http://lczero.org/training_data leads to 404

Hi lczero team,

I'm trying to set up the training pipeline but it looks like the training data link leads to 404. Where can I download training runs? Thanks!

Material Evaluation by LC0

Hi
A recent article published by the Alpha Zero Team, gave a material evaluation of the chess pieces.
Is it possible to have the same evaluation with the Current LC0?

UnboundLocalError: 'mse_loss' is used before assignment when using `value: classical` in yaml file

Hi There,

I'm doing some supervised learning with the use of engine scores as policy data. Because of that, I'm using value: classical in my yaml file.

Once I begin the training, I get this error in my tfprocess.py file:

UnboundLocalError: 'mse_loss' is used before assignment

This is triggered by the code below. I noticed that mse_loss is not defined if I choose a value that is not wdl.

    def process_inner_loop(self, x, y, z, q, m):
        with tf.GradientTape() as tape:
            outputs = self.model(x, training=True)
            policy = outputs[0]
            value = outputs[1]
            policy_loss = self.policy_loss_fn(y, policy)
            reg_term = sum(self.model.losses)
            if self.wdl:
                value_ce_loss = self.value_loss_fn(self.qMix(z, q), value)
                value_loss = value_ce_loss
            else:
                value_mse_loss = self.mse_loss_fn(self.qMix(z, q), value)
                value_loss = value_mse_loss
            if self.moves_left:
                moves_left = outputs[2]
                moves_left_loss = self.moves_left_loss_fn(m, moves_left)
            else:
                moves_left_loss = tf.constant(0.)

            total_loss = self.lossMix(policy_loss, value_loss,
                                      moves_left_loss) + reg_term
            if self.loss_scale != 1:
                total_loss = self.optimizer.get_scaled_loss(total_loss)
        if self.wdl:
            mse_loss = self.mse_loss_fn(self.qMix(z, q),  #value)
        else:
            value_loss = self.value_loss_fn(self.qMix(z, q), value)
        return policy_loss, value_loss, mse_loss, moves_left_loss, reg_term, tape.gradient(
            total_loss, self.model.trainable_weights)

To patch this, I simply assigned mse_loss = value_mse_loss like below:

        if self.wdl:
            mse_loss = self.mse_loss_fn(self.qMix(z, q), value)
        else:
            mse_loss = value_mse_loss
            value_loss = self.value_loss_fn(self.qMix(z, q), value)

After this, the code has worked. I'm not exactly very knowledgable about machine learning so let me know if the above makes sense. If it does, I'll submit a pull request.

Split script trim assumes that test and train are correct ratios before starting.

The trim step always removes a constant number of oldest entries from test and a different constant number from train. If the initial state is unbalanced, this inbalance continues to exist.

Probably as a first step before running diff to to calculate which games need to be brought over, there should be a balancing trim - which instead of trimming by a constant amount, trims to ensure both test and train are no larger than the intended total, and that even if smaller, they have the correct ratio.

net_to_model.py - ValueError: Number of filters in YAML doesn't match the network

Using t40.yml from the branch on github, using network 40800 (tried others, same result). There are no games in the input_test / train directories, as I'm just trying to generate the model at this point. If I pass in a t30 network with the t40.yml I get much further (scroll down)

input
(base) D:\Chess\Training\lczero-training\tf>python net_to_model.py --cfg=../configs/t40.yml 40800

output

dataset:
  input_test: D:\\Chess\\Training\\lczero-training\\games\test\b001
  input_train: D:\\Chess\\Training\\lczero-training\\games\\train\b001
  num_chunks: 500000
  train_ratio: 0.9
gpu: 0
model:
  filters: 256
  policy_channels: 80
  residual_blocks: 20
  se_ratio: 8
name: 256x20-t40
training:
  batch_size: 4096
  checkpoint_steps: 10000
  lr_boundaries:
  - 100
  lr_values:
  - 0.02
  - 0.02
  max_grad_norm: 2
  num_batch_splits: 8
  path: D:\\Chess\\Training\\lczero-training\\networks
  policy_loss_weight: 1.0
  shuffle_size: 500000
  swa: true
  swa_max_n: 10
  swa_steps: 25
  test_steps: 125
  total_steps: 250
  train_avg_report_steps: 25
  value_loss_weight: 1.0
  warmup_steps: 125

Traceback (most recent call last):
  File "net_to_model.py", line 25, in <module>
    raise ValueError("Number of filters in YAML doesn't match the network")
ValueError: Number of filters in YAML doesn't match the network

YAML for reference

%YAML 1.2
---
name: '256x20-t40'                  # ideally no spaces
gpu: 0                                 # gpu id to process on

dataset: 
  num_chunks: 500000                   # newest nof chunks to parse
  train_ratio: 0.90                    # trainingset ratio
  # For separated test and train data.
  input_train: 'D:\\Chess\\Training\\lczero-training\\games\\train\b001' # supports glob
  input_test: 'D:\\Chess\\Training\\lczero-training\\games\test\b001'  # supports glob
  # For a one-shot run with all data in one directory.
  #input: '/work/lc0/data/'

training:
    swa: true
    swa_steps: 25
    swa_max_n: 10
    max_grad_norm: 2
    batch_size: 4096                   # training batch
    num_batch_splits: 8
    test_steps: 125                    # eval test set values after this many steps
    train_avg_report_steps: 25        # training reports its average values after this many steps.
    total_steps: 250                  # terminate after these steps
    warmup_steps: 125
    checkpoint_steps: 10000          # optional frequency for checkpointing before finish
    shuffle_size: 500000               # size of the shuffle buffer
    lr_values:                         # list of learning rates
        - 0.02
        - 0.02
    lr_boundaries:                     # list of boundaries
        - 100
    policy_loss_weight: 1.0            # weight of policy loss
    value_loss_weight: 1.0            # weight of value loss
    path: 'D:\\Chess\\Training\\lczero-training\\networks'          # network storage dir

model:
  filters: 256
  residual_blocks: 20
  se_ratio: 8
  policy_channels: 80
...

T30 attempt

D:\Chess\Training\lczero-training\tf>python net_to_model.py --cfg=../configs/t40.yml 32890

output

dataset:
  input_test: D:\\Chess\\Training\\lczero-training\\games\test\\b001
  input_train: D:\\Chess\\Training\\lczero-training\\games\\train\\b001
  num_chunks: 500000
  train_ratio: 0.9
gpu: 0
model:
  filters: 256
  policy_channels: 80
  residual_blocks: 20
  se_ratio: 8
name: 256x20-t40
training:
  batch_size: 4096
  checkpoint_steps: 10000
  lr_boundaries:
  - 100
  lr_values:
  - 0.02
  - 0.02
  max_grad_norm: 2
  num_batch_splits: 8
  path: D:\\Chess\\Training\\lczero-training\\networks
  policy_loss_weight: 1.0
  shuffle_size: 500000
  swa: true
  swa_max_n: 10
  swa_steps: 25
  test_steps: 125
  total_steps: 250
  train_avg_report_steps: 25
  value_loss_weight: 1.0
  warmup_steps: 125

2019-02-16 13:31:07.643530: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
2019-02-16 13:31:07.918932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2019-02-16 13:31:07.927027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-16 13:31:08.403139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-16 13:31:08.407630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-16 13:31:08.410846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-16 13:31:08.414238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10137 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From D:\Chess\Training\lczero-training\tf\tfprocess.py:144: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.


Traceback (most recent call last):
  File "net_to_model.py", line 39, in <module>
    tfp.replace_weights(weights)
  File "D:\Chess\Training\lczero-training\tf\tfprocess.py", line 302, in replace_weights
    new_weight = tf.constant(new_weights[e], shape=weights.shape)
  File "D:\Users\brandon\Miniconda3\lib\site-packages\tensorflow\python\framework\constant_op.py", line 208, in constant
    value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "D:\Users\brandon\Miniconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 497, in make_tensor_proto
    (shape_size, nparray.size))
ValueError: Too many elements provided. Needed at most 256, but received 589824

Training without GPU

Is it possible to train the model without using a GPU? I know it's not efficient, but I'm running some experiments in my local computer before moving the experiments to a machine with GPU.

The gpus arg in the configuration file is mandatory, so a value must be specified, my question is which value to specify if I don't want to use GPU for training.

Thanks!

which python version is required

ERROR: Could not find a version that satisfies the requirement tensorflow==2.5.1
ERROR: No matching distribution found for tensorflow==2.5.1

Could fix it by using conda but since I don't know which version I need I have to google that

ZeroDivisionError in squeeze_excitation function due to SE_ratio being set to zero

Description
I encountered a ZeroDivisionError during the execution of the net_to_model.py script in the lczero-training project. The error occurs in the squeeze_excitation function within the tfprocess.py file, specifically at the line where it asserts that the number of channels is evenly divisible by self.SE_ratio. The subsequent division operation leads to a division by zero, indicating that self.SE_ratio is inadvertently set to zero.

Steps to Reproduce

Clone the lczero-training repository with submodules.
Install necessary Python packages: numpy, tensorflow, protobuf.
Download specific network weights and configuration files.
Initialize and run the training setup as per the provided instructions.
The error occurs during the execution of the net_to_model.py script, specifically when the squeeze_excitation function is called.

Expected Behavior
The squeeze_excitation function should execute without errors, processing the input tensor by applying squeeze and excitation operations based on a non-zero SE_ratio.

Actual Behavior
The execution fails with a ZeroDivisionError, indicating that self.SE_ratio is set to zero, which is not expected. The traceback points to the squeeze_excitation function in tfprocess.py.

Environment
https://colab.research.google.com/drive/1a3lkH1IUG-P_Y7scNjenmTmRdJ0RF_5R?usp=sharing

Additional Context
The error suggests a misconfiguration or an oversight in the initialization of the SE_ratio. This parameter is crucial for the squeeze-excitation operation, and it should be a positive integer that divides the number of channels without remainder. It's possible that this is either a code bug or a configuration issue.

Here's the relevant portion of the error message for quick reference:

Traceback (most recent call last):
  File "/content/lczero-training/tf/net_to_model.py", line 28, in <module>
    tfp.init_net()
  File "/content/lczero-training/tf/tfprocess.py", line 383, in init_net
    outputs = self.construct_net(input_var)
  File "/content/lczero-training/tf/tfprocess.py", line 1529, in construct_net
    flow = self.create_residual_body(inputs)
  File "/content/lczero-training/tf/tfprocess.py", line 1424, in create_residual_body
    flow = self.residual_block(flow,
  File "/content/lczero-training/tf/tfprocess.py", line 1248, in residual_block
    out2 = self.squeeze_excitation(self.batch_norm(conv2,
  File "/content/lczero-training/tf/tfprocess.py", line 1196, in squeeze_excitation
    assert channels % self.SE_ratio == 0
ZeroDivisionError: integer division or modulo by zero

I would appreciate any insights into this issue or suggestions on how to properly configure the SE_ratio to avoid this error.

EDIT

Network: 11248.pb.gz
YAML configuration: 256x20.yaml
How to reproduce: lc0-net-to-model.ipynb

Report a 'train' number before step 1 like for 'test'.

I'm not sure whether this could be done easily without being considered a risk to training quality and as such not be worth the probably small payout, but currently with test train separation we have a good measure of movement from data change in the test line, but since train line is just a trailing average, not for that one.

This could be done by running the same logic that generates the test output, but feeding it train side of the split data instead, then output its results named as train with a step count of 1. To be truly useful the last 'train' value would probably need to be generated the same way, as otherwise its just a trailing average and the difference between that value and the first of the next run won't be a pure 'data shift only' measure.

Potential value from having train values also show a data shift is that a difference in size of the data shift between test and train could be informative as to the amount of overfit that is leaving the window without being resolved first.

Raise error on training data from different format versions

Perhaps an error could be raised to warn if training data of different formats is mixed? It took me a bit to figure out why it was failing until I realized that the record sizes being read were varying.

EDIT: Hmm, using any non-v6 training data runs into problems since parse has v6_gen hardcoded into it:

    def parse(self):
        """
        Read data from child workers and yield batches of unpacked records
        """
        gen = self.v6_gen()        # read from workers
        gen = self.tuple_gen(gen)  # convert v6->tuple
        gen = self.batch_gen(gen)  # assemble into batches
        for b in gen:
            yield b

Training MLH off of result MLHZ and prediction MLHR as MLHRZ_RATIO

EDIT: I've edited the proposed function to include a limit convergence rate for the sigmoid function.
The idea is to include searched-M-at-root and actual MLH related to the result in a ratio for training MLH, much like how Q-ratio currently takes Z and Q to train W,D,L.

One issue with a pure, unfettered ratio between the two may be asking for disaster, as when the result nears, the actual result should be more certain and accurate (at least, early in training) than the searched-M-at-root.

This might be a tough 4-multivariable tune, so some of these factors can be eliminated if need be--however, each of the factors should improve the fidelity of the end product. (I consider ω and θ of highest priority.)

We can use MLHZ ≥ 1 sigmoid function (with optional negative parameters ω, ψ) to create the moves-left based ratio:
MLHRZ_RATIO(ω,C,ψ,θ)
MLHRZ_RATIO=ω/(1+e^C(MLHZ-ψ))+θ

where the limit is
0 < lim (MLHRZ_RATIO=ω/(1+e^C(MLHZ-ψ))+θ) < 1
MLHZ->+∞ and MLHZ->0
or for simplicity (ignoring the allowed properties by C and ψ):
0 < ω+θ < 1

C is a convergence factor that adjusts the MLHZ-related convergence to the limits of the sigmoid output.

ω is a -1+θ < ω < 1-θ bounded coefficient which modulates the height of the sigmoid curve.

Where ψ is (chosen as it is a high-level pun for left-right "psichology") a coefficient to shift the sigmoid curve right if positive, and left if negative.

θ up/downshifts the sigmoid function, bounded (but not strictly bounded because a lower/upper division of the sigmoid may be cut off before MLHZ≥1) by 0 < θ < 1

VarianceScaling Initializer Is Unseeded

@Tilps ran into this issue after upgrading TensorFlow:

/usr/local/lib/python3.8/dist-packages/keras/initializers/initializers.py:120: UserWarning: The initializer VarianceScaling is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initalizer instance more than once.

From Tilps:

I think its a bug in our code - I think you are supposed to construct a separate instance for every use case, so that 'call' is only invoked at most once.

I assume it is this code the above warning references.

@masterkni6, can you look at this when you get a chance?

leelalogs writing in train_input folder bug on colab

I encountered strange bug related to leelalogs. Train.py created leelalogs folder in folder where training chunks are! Saw it with another guy who tried colab training and then encountered it myself...
My train.py and lczero-training folder are on google drive. Unpacked train and test chunks folders are on colab hdd. It worked fine. Script used chunks from colab hdd, saved leelalogs to lczero/tf/leelalogs folder on google drive. As I know this path is hardcoded inside train.py, right? I feel that sometimes google drive mount can fail and train.py wants to write somethere and write it in input! Worst possible place. Then it tries to parse this as chunks and training fails... It should write it in some another place :)

leelachesszero / lczero-training Goto Github PK

lczero-training's People

Contributors

Stargazers

Watchers

Forkers

lczero-training's Issues

About the input

Board Representation:

T Sets of M Planes:

Binary Feature Planes:

Board Orientation:

Additional Features:

Chess-Specific Implementation:

About the output

output[0]

output[1] and output[0]

Important!

(old text below)

Help wanted:

Important!

(old text below)

Help wanted:

Recommend Projects

Recommend Topics

Recommend Org

Jobs